Page last updated on 2015 March 15

*Enrollment code:* 54205

*Prerequisite:* ECE 154 (or equivalent)

*Class meetings:* MW 10:00-11:30, Phelps 1431

*Instructor:* Professor Behrooz Parhami

*Open office hours:* MW 12:30-2:00; HFH 5155

**Course announcements:** Listed in reverse chronological order

**Course calendar:** Lecture, homework, and exam schedules

**Homework assignments:** Four assignments, worth a total of 30%

**Exams:** Open-book midterm, worth 30%, and final, worth 40%

**Research paper:** Not applicable to winter 2015

**Research paper guidlines:** Brief guide to format and contents

**Poster presentation tips:** Brief guide to format and structure

**Policy on academic integrity:** Please read very carefully

**Grades:** Statistics for homework and exam grades

**References:** Textbook and other sources (Textbook's web page)

**Lecture slides:** Via the textbook's Web page

**Miscellaneous information:** Motivation, catalog entry, history

**2015/03/11:** Slightly updated versions of the slides and chapters for part VII of the textbook have been uploaded to the book's Web page. Remember that the final exam begins at 8:00 AM on Monday 3/16.
**2015/03/04:** Correction to the textbook: The first solved example in Section 21.6 contains an error. The optimal value of *q* should be 59 (that is, 58 checkpoints) and the resulting total running time should be 211 hr. Problem 21.1 must also be modified to ask about the source of the additional 4 hours in the estimated running time. [Thanks to Zach Wells and Adam Toth for uncovering this error.]
**2015/02/28:** Homework 4 has been posted to the homework area below. Updated slides and chapters for parts VI and VII of the textbooks have also been posted. Please make an effort to attend an interesting talk by Leslie Lamport entitled "Who Builds a Skyscraper without Drawing Blueprints" (UCSB Campus, ESB 1001, 11:00 AM – 12:00 PM, Friday March 13, 2015). In case you can't attend, here is a previous delivery of the same talk on YouTube. I will hold extra office hours on F 3/13, 9:00-10:30 AM, in view of our final exam being held on M 3/16.
**2015/02/13:** Homework 3 has been posted to the homework area below. Updated slides/chapters for part V of the textbooks have also been posted. Stats for midterm exam grades will be posted no later than Sat. 2/14.
**2015/02/03:** Updated slides and chapters for part IV of the textbook have been posted to the book's Web page. Remember that for the midterm and final exams, you need to study any material not specifically excluded in the respective study guides, even if we did not have time to cover the topics in class.
**2015/01/30:** Updated slides and chapters for part III of the textbook have been posted to the book's Web page. The midterm study guide has also been updated for winter 2015.
**2015/01/24:** Homework 2 has been posted to the homework area below. Updated slides and chapters for part II of the textbooks have also been posted.
**2015/01/09:** Homework 1 has been posted to the homework area below. Updated slides and chapters for part I of the textbooks have also been posted.
**2015/01/04:** Welcome to the ECE 257A web page for winter 2015. Thus far, 15 students have signed up for the class and I look forward to meeting you all on Monday 1/05. The following information must be considered tentative at this time. Details will be finalized in the first week of classes and updated regularly thereafter. I will be updating and improving the on-line course textbook and lecture slides as we go through the winter quarter. Please pay attention to the associated posting date when downloading material for the course.

Course lectures, homework assignments, and exams, have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates. Please begin work on your assignments early. Each lecture corresponds to topics in 1-2 chapters of the instructor's forthcoming textbook on dependable computing. Chapter numbers are provided in parentheses, after day & date.

**Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}**

M 01/05 (0-1) Background and motivation

W 01/07 (1-2) Dependability attributes

M 01/12 (3) Combinational modeling [HW1 posted, chs. 1-4]

W 01/14 (4) State-space modeling

M 01/19 MLK Birthday observed; no lecture

W 01/21 (5, 7) Defect avoidance; Shielding and hardening [HW1 due]

M 01/26 (6, 8) Defect circumvention; Yield enhancement [HW2 posted, chs. 5-12]

W 01/28 (9, 11) Fault testing; Design for testability

M 02/02 (10, 12) Fault masking; Replication with voting

W 02/04 (13, 15) Error detection; Self-checking modules [HW2 due]

M 02/09 (14, 16) Error correction; Redundant disk arrays

W 02/11 (1-12) Midterm exam, open-book/notes, 10:00-11:45 (note the extened time)

M 02/16 No lecture: Presidents Day holiday; no lecture [HW3 posted, chs. 13-20]

W 02/18 (17, 19) Malfunction diagnosis; Standby redundancy

M 02/23 (18, 20) Malfunction tolerance; Robust parallel processing

W 02/25 (21, 23) Degradation allowance; Resilient algorithms [HW3 due]

M 03/02 (22, 24) Degradation management; Software redundancy [HW4 posted, chs. 21-28]

W 03/04 (25, 27) Failure confinement; Agreement and adjudication

M 03/09 (26, 28) Failure recovery; Fail-safe systems {Instructor and course evaluations}

W 03/11 Catching up, and review of current research in the field [HW4 due]

M 03/16 (13-28) Final exam, open-book/notes, 9:00-11:00

T 03/24 {Course grades due by midnight}

-Turn in solutions in class before the lecture begins.

-Because solutions will be handed out on the due date, no extension can be granted.

-Use a cover page that includes your name, course name, and assignment number.

-Staple the sheets and write your name on top of each sheet in case they are separated.

-Although some cooperation is permitted, direct copying will have severe consequences.

** Homework 1: Dependability and its modeling** (ch. 1-4, due W 2015/01/21, 10:00 AM)

Do the following problems from the textbook: 1.4, 1.21, 2.22, 3.19, 4.8

** Homework 2: Defects and faults** (ch. 5-12, due W 2015/02/04, 10:00 AM)

Do the following problems from the textbook: 5.2, 7.3, 8.2, 9.6, 10.6, 11.6ab

** Homework 3: Errors and malfunctions** (ch. 13-20, due W 2015/02/25, 10:00 AM)

Do the following problems from the textbook or defined below: 13.1, 14.9, 16.5, 17.9, 18.3ab, 19.1

[Gao15] Gao, Z., P. Reviriego, W. Pan, Z. Xu, M. Zhao, J. Wang, and J. A. Maestro, "Fault Tolerant Parallel Filters Based on Error Correction Codes,"

a. Consider a disk array with

b. What is the reliability of the disk array of part a over a 1-year period?

c. Consider a disk array with

d. What is the reliability of the disk array of part c over a 1-year period?

e. For

f. For

** Homework 4: Degradations and failures** (ch. 21-28, due W 2015/03/11, 10:00 AM)

Do the following problems from the textbook: 21.2, 22.2, 24.6, 26.1, 27.6

The following sample exam problems are meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar).

Students are responsible for all sections and topics in the textbook and class handouts that are not explicitly excluded in the study guide that follows each sample exam, even if the material was not covered in class lectures.

*Sample Midterm Exam (105 minutes)*

Problems 3.12, 4.4, 9.4, and 12.1 from the textbook.

*Midterm Exam Study Guide*

Study Chapters 1-12 and review the problems in homework assignments 1-2. The following textbook sections are excluded: 6.6, 7.6, 8.6, 9.4, 9.6, 11.6

*Sample Final Exam (120 minutes)*

Problems 15.5, 17.1, 21.2, and 27.3 from the textbook.

*Final Exam Study Guide*

Study Chapters 13-28 and review the problems in homework assignments 3-4. The following textbook sections are excluded: 13.6, 14.6, others TBD

[*Not applicable to the winter 2015 offering.*] Each student will review a subfield of dependable computing or do original research on a selected and approved topic. A preliminary list of research topics is provided below (new topics, and new references for the current topics, may be added later). However, students should feel free to propose their own topics for approval. To propose a topic, send via e-mail a one-page narrative, including 2-3 key references, to the instructor.

A publishable report earns an "A" for the course, regardless of homework and midterm grades. See the course calendar for schedule and due dates and Research Paper Guidlines for formatting tips.

This year's suggested research topics for ECE 257A are built around the theme "Robustness of Interconnection networks." You can get started on each topic by taking a look at the following two common references, plus one topic-specific reference that is provided further down on this page. The two common references are:

[Parh10] Parhami, B., "Robustness Attributes of Interconnection Networks for Parallel Processing," Keynote Lecture at the First Int'l Supercomputing Conf., Guadalajara, Mexico, March 2010. {PPT and PDF slides are available from B. Parhami's Publications Web page; see publication [262].}

[Sall12] Salles, R. M. and D. A. Marion Jr., "Strategies and Metric for Resilience in Computer Networks," *Computer J.*, Vol. 55, No. 6, pp. 728-739, June 2012.

1. Effects of Missing Nodes on Network Diameter and Average Distance (Assigned to: TBD)

[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," *Computers & Mathematics with Applications*, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

2. Effects of Missing Links on Network Diameter and Average Distance (Assigned to: TBD)

[Kris87] Krishnamoorthy, M.S. and B. Krishnamurthy, "Fault Diameter of Interconnection Networks," *Computers & Mathematics with Applications*, Vol. 13, Nos. 5/6, pp. 577-582, 1987.

3. Synthesis of Interconnection Networks with Maximal Fault Tolerance (Assigned to: TBD)

[Chen09] W. Chen, W. J. Xiao, and B. Parhami, "Swapped (OTIS) Networks Built of Connected Basis Networks are Maximally Fault Tolerant," *IEEE Trans. Parallel and Distributed Systems*, Vol. 20, pp. 361-366, March 2009.

4. Adaptive Schemes for Point-to-Point Communication in Networks (Assigned to: TBD)

[Ngai91] Ngai, J. Y. and C. L. Seitz, "A Framework for Adaptive Routing in Multicomputer Networks," *Computer Architecture News*, Vol. 19, No. 1, pp. 6-14, March 1991.

5. Adaptive Schemes for Collective Communication in Networks (Assigned to: TBD)

[Pand95] Panda, D. K., "Issues in Designing Efficient and Practical Algorithms for Collective Communication on Wormhole-Routed Systems," *Proc. Int'l Conf. Parallel Processing Workshop on Challenges for Parallel Processing*, 1995, pp. 8-15.

6. Deadlocks in Adaptive Routing and How to Avoid or Detect Them (Assigned to: TBD)

[Dall93] Dally, W. J. and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels," *IEEE Trans. Parallel and Distributed Systems*, Vol. 4, No. 4, pp. 466-475, April 1993.

7. Diagnosability of Regular Degree-*d* Interconnection Networks (Assigned to: TBD

[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," *IEEE Trans. Parallel and Distributed Systems*, Vol. 16, No. 4, pp. 314-323, April 2005

8. Diagnosability of Hierarchical or Multilevel Interconnection Networks (Assigned to: TBD)

[Xu09] Xu, M., K. Thulasiraman, and X.-D. Hu, "Conditional Diagnosability of Matching Composition Networks Under the PMC Model," *IEEE Trans. Circuits and Systems II*, Vol. 56, No. 11, pp. 875-879, November 2009.

9. Synthesis of Interconnection Networks with Maximal Diagnosability (Assigned to: TBD)

[Chan05] Chang, G.-Y., G. J. Chang, and G.-H. Chen, "Diagnosabilities of Regular Networks," *IEEE Trans. Parallel and Distributed Systems*, Vol. 16, No. 4, pp. 314-323, April 2005

*Topics outside the main theme for the quarter*

10. Software Fault Monitoring (assigned to: TBD)

[Delg04] Delgado, N., A. Q. Gates, and S. Roach, "A Taxonomy and Catalog of Runtime Software-Fault Monitoring Tools," *IEEE Trans. Software Engineering*, Vol. 30, No. 12, pp. 859-872, December 2004

Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.

Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.

*All grades listed are in percent, unless otherwise noted*.

HW1 grades: Range = [65, 96], Mean = 84, Median = 88

HW2 grades: Range = [60, 91], Mean = 79, Median = 79

HW3 grades: Range = [72, 100], Mean = 83, Median = 81

HW4 grades: Range = [75, 94], Mean = 85, Median = 82

Midterm exam grades: Range = [50, 98], Mean = 74, Median = 75

Final exam grades: Range = [00, 00], Mean = 00, Median = 00

Course grades, A-F: Range = [0.0, 0.0], Mean = 0.0, Median = 0.0

** Required text:** B. Parhami,

Koren/Krishna,

Shooman,

Siewiorek/Swarz,

Johnson,

*Research resources:**Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks* (DSN), formerly known as Fault-Tolerant Computing Symp. (FTCS), annual, since 1971.
*IEEE Trans. Dependable and Secure Computing*, quarterly journal, published since 2004
*IEEE Trans. Reliability*, Quarterly journal, published since 1955
*IEEE Trans. Computers*, monthly journal, published since 1952

UCSB library's electronic journals, collections, and other resources

** Motivation:** Dependability concerns are integral parts of engineering design. Ideally, we would like our computer systems to be perfect, always yielding timely and correct results. However, just as bridges collapse and airplanes crash occasionally, so too computer hardware and software cannot be made totally immune to unpredictable behavior. Despite great strides in component reliability and programming methodology, the exponentially increasing complexity of integrated circuits and software systems makes the design of prefect computer systems nearly impossible. In this course, we study the causes of computer system failures (impairments to dependability), techniques for ensuring correct and timely computations despite such impairments, and tools for evaluating the quality of proposed or implemented solutions.

*Catalog entry:* 257A. Fault-Tolerant Computing. (4) PARHAMI.*Prerequisites: ECE 154. Lecture, 3 hours*. Basic concepts of dependable computing. Reliability of nonredundant and redundant systems. Dealing with circuit-level defects. Logic-level fault testing and tolerance. Error detection and correction. Diagnosis and reconfiguration for system-level malfunctions. Degradation management. Failure modeling and risk assessment.

** History:** Professor Parhami took over the teaching of ECE 257A in the fall quarter of 1998. Previously, the course had been taught primarily by Dr. John Kelly, who instituted the two-course sequence ECE 257A/B, the first covering general topics and the second (now discontinued) devoted to his research focus on software fault tolerance. Borrowing from his experience in teaching dependable computing at other universities and based on an extensive survey of the field that he published in 1994, Professor Parhami oriented the course toward an original multilevel view of impairments to computer system dependability and techniques for avoiding or tolerating them. The levels of this models, in increasing order of abstraction, are: defects, faults, errors, malfunctions, degradations, and failures. A textbook based on this multilevel model of dependable computing is in preparation.

Offering of ECE 257A in fall 2013 (PDF file)

Offering of ECE 257A in fall 2012 (PDF file)

Offering of ECE 257A in fall 2009 (PDF file)

Offering of ECE 257A in fall 2007 (PDF file)

Offerings of ECE 257A in 1998 and 2006 (PDF file)