Behrooz Parhami's website banner

Menu:

Behrooz Parhami's ECE 257A Course Page for Fall 2023

Collapsed bridge

Fault-Tolerant Computing

Page last updated on 2023 October 21

Enrollment code: 13375
Prerequisite: ECE 154 (computer architecture), or equivalent
Class meetings: MW 10-11, ILP 4105 (inverted classroom)
Instructor: Professor Behrooz Parhami
Open office hours: MW 11-11:50, ILP 4105
Course announcements: Listed in reverse chronological order
Course calendar: Lecture, homework, and exam schedules
Homework assignments: Four assignments, worth a total of 40%
Exams: None for fall 2023
Research paper: Report 50%; Poster 10%
Research paper guidlines: Brief guide to format and contents
Poster presentation tips: Brief guide to format and structure
Policy on academic integrity: Please read very carefully
Grades: Statistics for homework and other grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Via the textbook's Web page
Miscellaneous information: Motivation, catalog entry, history

Course Announcements

Megaphone

2023/10/21: The course ECE 257A has been cancelled for fall 2023. Therefore, this page will no longer be updated. I will maintain my MW 11:00-11:50 office hours in ILP 4105 for the rest of the fall quarter. Please check back here in early summer 2024 to learn about the next offering of the course, which will likely be during fall quarter 2024.
2023/10/01: HW1 has been posted to the homework area below. Please watch Lecture 1 before our first class on Monday 10/02.
2023/09/21: In view of rising numbers of COVID-19 cases, UCSB campus guidelines recommend frequent hand-washing, optional use of face-masks, and testing when exposed or showing symptoms.
2023/06/02: Welcome to the ECE 257A Web page for fall 2023. As of today, enrollment stands at 4. The course will be research-based, with 60% of your grade determined by your research report & poster and 40% based on homework.
I will use an inverted classroom model. Video of each lecture must be watched before the scheduled date on the course calendar. The first hour of our in-person class meeting will be devoted to discussion and Q&A on the topic, with the following 50 minutes serving as an open office hour held in the same classroom. Students will be free to leave after the one-hour discussion session.

Course Calendar

Calendar

Course lectures, homework assignments, and research paper deadlines have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.

Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
M 10/02 (1) Background and motivation [HW1 posted, chs. 1-4] {Lec. 1}
W 10/04 (2) Dependability attributes {Lec. 2}

M 10/09 (3) Combinational modeling {Lec. 3}
W 10/11 (4) State-space modeling {Lec. 4}

M 10/16 Special presentation on research topics for fall 2023 [HW1 due] {Research topics defined}
W 10/18 (5, 7) Defect avoidance; Shielding and hardening {Lec. 5}

M 10/23 No lecture or class (instructor away)
W 10/25 No lecture or class (instructor away) [HW2 posted, chs. 5-12]

M 10/30 (6, 8) Defect circumvention; Yield enhancement {Research topic preferences due} {Lec. 6}
W 11/01 (9, 11) Fault testing; Design for testability {Research topics assigned} {Lec. 7}

M 11/06 (10, 12) Fault masking; Replication with voting {Lec. 8}
W 11/08 (13, 15) Error detection; Self-checking modules [HW2 due] {Lec. 9}

M 11/13 (14, 16) Error correction; RAID systems [HW3 posted, chs. 13-20] {Prelim. ref's due} {Lec. 10}
W 11/15 (17, 19) Malfunction diagnosis; Standby redundancy {Lec. 11}

M 11/20 (18, 20) Malfunction tolerance; Robust parallel processing {Lec. 12}
W 11/22 No lecture or class (research time allowance) [HW4 posted, chs. 21-28]

M 11/27 (21, 23) Degradation allowance; Resilient algorithms [HW3 due] {Lec. 13}
W 11/29 (22, 24) Degradation mgmt; SW redundancy {Ref's & provisional abst. due} {Lec. 14}

M 12/04 (25, 27) Failure confinement; Agreement and adjudication {Lec. 15} {Optional poster submission}
W 12/06 (26, 28) Failure recovery; Fail-safe systems [HW4 due] {Lec. 16}

W 12/13 {Full research paper & poster PDF files due by midnight}
W 12/20 {Course grades due by midnight}

Homework Assignments

Homework image

- Turn in your solutions as a PDF file attached to an e-mail sent by the due date/time.
- Because solutions will be handed out on the due date, no extension can be granted.
- Include your name, course name, and assignment number at the top of the first page.
- If homework is handwritten and scanned, make sure that the PDF is clean and legible.
- Although some cooperation is permitted, direct copying will have severe consequences.

Homework 1: Dependability and its modeling (chs. 1-4, due M 2023/10/16, 10:00 AM)
Do the following problems from the textbook: 1.16, 1.27, 2.22, 3.19, 4.4, 4.15

Homework 2: Defects and faults (chs. 5-12, due W 2023/11/08, 10:00 AM)
Do the following problems from the textbook: To be posted here no later than W 10/25

Homework 3: Errors and malfunctions (chs. 13-20, due W 2023/11/27, 10:00 AM)
Do the following problems from the textbook: To be posted here no later than M 11/13

Homework 4: Degradations and failures (chs. 21-28, due W 2023/12/06, 10:00 AM)
Do the following problems from the textbook: To be posted here no later than W 11/22

Sample Exams and Study Guide (does not apply to fall 2023)

Answer sheet

The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).
Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.

Sample Midterm Exam (105 minutes)
Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.

Midterm Exam Study Guide
Study Chapters 1-12 and review the problems in homework assignments 1-2. The following textbook sections are excluded: 6.6, 7.6, 8.6, 9.4, 9.6, 11.6

Sample Final Exam (120 minutes)
Problems 15.5, 17.1, 21.2, and 27.3 from the textbook.

Final Exam Study Guide
Study Chapters 13-28 and review the problems in homework assignments 3-4. The following textbook sections are excluded: 13.6, 14.6

Research Paper and Presentation

Colored marbles Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A list of pre-approved research topics is provided below. However, students should feel free to propose their own topics for approval. To propose a topic, send via e-mail a one-page narrative, including 2-3 key references, to the instructor.

A publishable report earns an "A" for the course, regardless of homework grades. See the course calendar for schedule & due dates and Research Paper Guidlines for formatting tips.

The following parts of the Research section have not yet been updated for fall 2023

Our research for fall 2022 will focus on fault tolerance and robustness in biological systems, whose attributes may allow us to build ultra-reliable biologically-inspired systems. A side benefit of biologically-inspired systems is low power consumption. The following are titles and starting references for individual research papers.

01. Biologically-inspired Methods of Self-Repair [Assigned to: Jiachen Zhang]
Self-repair is any method that allows a system to automatically return to full or at least better functionality after an undesirable event has "injured" it.
Stauffer, A., Mange, D., & Tempesti, G. (2006, January). Bio-inspired computing machines with self-repair mechanisms. In International Workshop on Biologically Inspired Approaches to Advanced Information Technology (pp. 128-140). Springer, Berlin, Heidelberg.
Samie, M., Dragffy, G., & Pipe, T. (2009, July). Novel bio-inspired self-repair algorithm for evolvable fault tolerant hardware systems. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers (pp. 2143-2148).

02. Trade-offs Between Efficiency and Robustness in Biological Systems
We know that efficiency optimizations in computer systems are done at the expense of robustness. To what extent is the same true in biological systems?
Vardi, M. (2020), A Computational Lens on Economics, CACM. https://cacm.acm.org/magazines/2020/7/245686-a-computational-lens-on-economics/fulltext
Carlson, J. M., & Doyle, J. (2002). Complexity and robustness. Proceedings of the national academy of sciences, 99(suppl_1), 2538-2545.

03. Robust Computation in Biological Systems [Assigned to: Kevin Yuen]
A computation is robust if its quality is not affected by minor perturbations in system resources or data. How is this desirable property achieved in biological systems?
Kitano, H. (2007). Towards a theory of biological robustness. Molecular systems biology, 3(1), 137.
Krakauer, D. C. (2006). Robustness in Biological Systems: a provisional taxonomy. In Complex systems science in biomedicine (pp. 183-205). Springer, Boston, MA.

04. Approximation Schemes in Biological Systems [Assigned to: Ian Wu]
Biological computations are either analog or low-precision. How do these properties affect the accuracy of results and how are the ensuing inaccuracies tolerated?
Hopfield, J. J. (1994). Physics, computation, and why biology looks so different. Journal of Theoretical Biology, 171(1), 53-60.
Chelly Dagdia, Z., Avdeyev, P., & Bayzid, M. (2021). Biological computation and computational biology: survey, challenges, and discussion. Artificial Intelligence Review, 54(6), 4169-4235.

05. Genetic Redundancy and Its Benefits [Assigned to: Henry Chang]
Redundancy is one of the most-important methods of ensuring dependability. Nature too uses redundancy. One example is redundancy in genes. Try to relate the two redundancy methods and draw conclusions.
Nowak, M. A., Boerlijst, M. C., Cooke, J., & Smith, J. M. (1997). Evolution of genetic redundancy. Nature, 388(6638), 167-171.
Laruson, A. J., Yeaman, S., & Lotterhos, K. E. (2020). The importance of genetic redundancy in evolution. Trends in ecology & evolution, 35(9), 809-822.

06. The Role of Redundancy in the Human Nervous System [Assigned to: Rahul Varghese]
Studies of brains with various kinds of damage shows that many essential functions are still performed, either by using the brain's natural redundancy or by remapping functions from one region to another.
Mizusaki, B. E., & O'Donnell, C. (2021). Neural circuit function redundancy in brain disorders. Current opinion in neurobiology, 70, 74-80.
Neilson, P. D., & Neilson, M. D. (2005). An overview of adaptive model theory: solving the problems of redundancy, resources, and nonlinear interactions in human movement control. Journal of neural engineering, 2(3), S279.

07. Regeneration and Self-Repair in Biological Systems [Assigned to: Dainong Hu]
Most cells can repair injuries inflicted on them by various sources. Some creatures are capable of regenerating lost organs. These are examples of self-repair without external assistance.
Yang, I., Jung, S. H., & Cho, K. H. (2016). Self-repairing digital system based on state attractor convergence inspired by the recovery process of a living cell. IEEE Trans. VLSI Systems, 25(2), 648-659.
Koop, F. (2022). Scientists map the brain of the axolotl—a unique creature that can create new neurons, ZME Science. https://www.zmescience.com/science/scientists-map-the-brain-of-the-axolotl-a-salamander-that-can-create-new-neurons-05092022/

08. Functional Redundancy in Humans and Other Animals [Assigned to: Ci-Chian Lu]
Redundancy in function is an effective complement to redundancy in resources. If multiple parts can perform the same function, then tasks can be prioritized and re-allocated, even in the absence of redundant resources.
Rosenfeld, J. S. (2002). Logical fallacies in the assessment of functional redundancy. Conservation Biology, 16(3), 837-839.
Biggs, C. R., Yeager, L. A., Bolser, D. G., Bonsell, C., Dichiera, A. M., Hou, Z., ... & Erisman, B. E. (2020). Does functional redundancy affect ecological stability and resilience? A review and meta-analysis. Ecosphere, 11(7), e03184.

09. Use of Repeated Computation and Voting in the Brain's Decision Processes [Assigned to: Yiliang Chen]
We have seen that replication (in space or time) along with voting is an effective method of fault- and malfunction-tolerance. To what extent does the brain use these methods to improve on result correctness?
Bischoff, I., Neuhaus, C., Trautner, P., & Weber, B. (2013). The neuroeconomics of voting: Neural evidence of different sources of utility in voting. Journal of neuroscience, psychology, and economics, 6(4), 215.
Hunt, L. T., & Hayden, B. Y. (2017). A distributed, hierarchical and recurrent framework for reward-based choice. Nature Reviews Neuroscience, 18(3), 172-182.

10. Use of Error Codes in Biological Systems [Assigned to: Shang-Hsun Yang]
Error-detecting and error-correcting codes are ubiquitous in computer and communication systems. How are these codes used in the human brain and other biological systems?
Battail, G. (2019). Error-correcting codes and information in biology. BioSystems, 184, 103987.
Leeson, M. S., & Higgins, M. D. (2012). Forward error correction for molecular communications. Nano Communication Networks, 3(3), 161-167.

11. Reconfiguration and Reprogramming in Biological Systems [Assigned to: Yuxuan Yin]
One way to achieve robustness and longevity is to reconfigure systems around non-functioning parts or to reprogram one part to perform the tasks of another parts. How are these methods used in biological systems?
Finc, K., Bonna, K., He, X., Lydon-Staley, D. M., Kuhn, S., Duch, W., & Bassett, D. S. (2020). Dynamic reconfiguration of functional brain networks during working memory training. Nature communications, 11(1), 1-15.
MacArthur, B. D., Ma'ayan, A., & Lemischka, I. R. (2009). Systems biology of stem cell fate and cellular reprogramming. Nature reviews Molecular cell biology, 10(10), 672-681.

12. Self-Healing Biological Cells [Assigned to: Alex Lai]
Most cells can recover from injuries inflicted on them by various sources. What are the biological bases for self-healing and to what extent are they trasferable to computer systems?
Ghosh, D., Sharman, R., Rao, H. R., & Upadhyaya, S. (2007). Self-healing systems—survey and synthesis. Decision support systems, 42(4), 2164-2185.
Diesendruck, C. E., Sottos, N. R., Moore, J. S., & White, S. R. (2015). Biomimetic self-healing. Angewandte Chemie International Edition, 54(36), 10428-10447.

13. Self-Healing Materials and Their Biological Bases [Assigned to: Jonghyun Park]
One of the domains where self-healing methods have been used rather successfully is material science. What are these methods and to what extent are they inspired by biological systems?
Harrington, M. J., Speck, O., Speck, T., Wagner, S., & Weinkamer, R. (2015). Biological archetypes for self-healing materials. Self-healing Materials, 307-344.
Bekas, D. G., Tsirka, K., Baltzis, D., & Paipetis, A. S. (2016). Self-healing materials: A review of advances in materials, evaluation, characterization and monitoring techniques. Composites Part B: Engineering, 87, 92-119.

14. Adaptation Schemes in Biological Systems to Improve Longevity [Assigned to: Sijia Liang]
Besides evolutionary changes that occur rather slowly, other adaptation schemes are at work for improving longevity. What are these adaptation schemes and how can we apply them to computing systems?
Peck, J. R., & Waxman, D. (2018). What is adaptation and how should it be measured? Journal of Theoretical Biology, 447, 190-198.
Gozhenko, A., Biryukov, V., Muszkieta, R., & Zukow, W. (2018). Physiological basis of human longevity: the concept of a cascade of human aging mechanism. Collegium antropologicum, 42(2), 139-146.

15. Redundant Signaling in Biological Systems [Assigned to: Cathy Geng]
Redundant signalling in the form of error-detecting and error-correcting codes has long been used in computer communications. Do biological systems used similar or vastly-different methods?
Teng, K. K., & Hempstead, B. L. (2004). Neurotrophins and their receptors: signaling trios in complex biological systems. Cellular and Molecular Life Sciences, 61(1), 35-48.
Zimmermann, M. (1989). The nervous system in the context of information theory. In Human physiology (pp. 166-173). Springer, Berlin, Heidelberg.

16. Robust Information Storage and Retrieval in Biological Systems [Assigned to: Shu-Yu Li]
Correct storage of data and correct retrieval of what is stored are important in ensuring correct operation of an information system. How are these critical properties achieved in biological systems/
Yim, S. S., McBee, R. M., Song, A. M., Huang, Y., Sheth, R. U., & Wang, H. H. (2021). Robust direct digital-to-biological data storage in living cells. Nature chemical biology, 17(3), 246-253.
Yim, A. K. Y., Yu, A. C. S., Li, J. W., Wong, A. I. C., Loo, J. F., Chan, K. M., ... & Chan, T. F. (2014). The essential component in DNA-based information storage system: robust error-tolerating module. Frontiers in bioengineering and biotechnology, 2, 49.

Poster Presentation Tips

Poster format

Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.

Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.

Grade Statistics

Chart

All grades listed are in percent, unless otherwise noted.
HW1 grades (letter): Range = [L, H], Mean = 0.00, Median = M
HW2 grades (letter): Range = [L, H], Mean = 0.00, Median = M
HW3 grades (letter): Range = [L, H], Mean = 0.00, Median = M
HW4 grades (letter): Range = [L, H], Mean = 0.00, Median = M
Overall homework grades: Range = [00, 00], Mean = 00, Median = 00
Research grades (letter): Range = [L, H], Mean = 0.00, Median = M
Research grades: Range = [00, 00], Mean = 00, Median = 00
Course grades (letter): Range = [L, H], Mean = 0.00, Median = M

References

Image of a reference book

Required text: B. Parhami, Dependable Computing: A Multilevel Approach, chapters will be posted as they are updated. Please visit the textbook's web page for general information. Lecture slides are also available there.
Some useful books (not required):
Koren/Krishna, Fault-Tolerant Systems, Morgan Kaufmann, 2007 (ISBN 0-12-088525-5)
Shooman, Reliability of Computer Systems and Networks, Wiley, 2002 (ISBN 0-471-29342-3)
Siewiorek/Swarz, Reliable Computer Systems, Digital Press, 1992 (ISBN 1-55558-075-0)
Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison Wesley, 1989 (ISBN 0-201-07570-9)
Iyer/Kalbarczyk/Nakka, Dependable Computing: Design and Assessment, IEEE Press, 2024 (ISBN 9781118709443)

Research resources:
Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
IEEE Trans. Dependable and Secure Computing, published since 2004
IEEE Trans. Reliability, published since 1955
IEEE Trans. Computers, published since 1952
UCSB library's electronic journals, collections, and other resources

Miscellaneous Information

Motivation: Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.

Catalog entry: 257A. Fault-Tolerant Computing. (4) PARHAMI. Prerequisites: ECE 154. Lecture, 3 hours. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.

History: Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.
Offering of ECE 257A in fall 2022
Offering of ECE 257A in fall 2021
Offering of ECE 257A in fall 2020
Offering of ECE 257A in fall 2019
Offering of ECE 257A in fall 2018
Offering of ECE 257A in fall 2016 (PDF file)
Offering of ECE 257A in fall 2015 (PDF file)
Offering of ECE 257A in winter 2015 (PDF file)
Offering of ECE 257A in fall 2013 (PDF file)
Offering of ECE 257A in fall 2012 (PDF file)
Offering of ECE 257A in fall 2009 (PDF file)
Offering of ECE 257A in fall 2007 (PDF file)
Offerings of ECE 257A in 1998 and 2006 (PDF file)