Behrooz Parhami's website banner


Behrooz Parhami's ECE 254B Course Page for Winter 2016

Parallel activities at a racing car's pit stop

Adv. Computer Architecture: Parallel Processing

Page last updated on 2016 March 15

Enrollment code: 55434
Prerequisite: ECE 254A (can be waived, but ECE 154 is required)
Class meetings: MW 10:00-11:30, Phelps 1431
Instructor: Professor Behrooz Parhami
Open office hours: M 12:30-2:00, W 1:00-2:00, HFH 5155
Course announcements: Listed in reverse chronological order
Course calendar: Schedule of lectures, homework, exams, research
Homework assignments: Five assignments, worth a total of 25%
Exams: Closed-book midterm (25%) and final (50%)
Research paper: Report and short oral presentation (not this quarter)
Research paper guidlines: Brief guide to format and contents
Poster presentation tips: Brief guide to format and structure
Grade statistics: Range, mean, etc. for homework and exam grades
References: Textbook and other sources (Textbook's web page)
Lecture slides: Available on the textbook's web page
Miscellaneous information: Motivation, catalog entry, history

Course Announcements

Megaphone 2016/03/15: The winter 2016 offering of ECE 254B is officially over and course grades have been submitted. Have a great rest of the finals week and a pleasant spring break!
2016/03/10: Add Sections 15.5 and 16.5 to those excluded from the final exam.
2016/02/26: The last course homework (HW5, due on Wednesday 3/09) has been posted to the homework area of this page. Updated lecture slides for Part IV (Chapters 13-16) and Part V (Chapters 17-20) have been posted to the textbook's Web page.
2016/02/11: Homework 4 (due on Monday 2/29) has been posted to the homework area of this page and grade stats have been brought up to date.
2016/02/06: The midterm exam study guide on this page is now up to date. Also, updated lecture slides for Part III (Chapters 9-12) are available from the textbook's Web page.
2016/01/28: As announced in class on W 1/27, my Wednesday's office hour has changed to 1:00-2:00. Please let me know if this change will cause any inconvenience.
2016/01/26: Homework 3 and a correction to Problem 6.14 of Homework 2 have been posted to the homework area of this page. Updated lecture slides for Part II" (Chapters 7-8C) will be posted to the textbook's Web page no later than Saturday 1/30.
2016/01/20: Homework 2, due in 12 days, has been posted to the homework area of this page. I apologize for my error in today's class, where I mentioned that HW2 is due to be posted over the weekend.
2016/01/14: Updated winter 2016 lecture slides for Part II' (Chapters 5-6C) are now available from the textbook's Web page.
2016/01/08: Homework 1 has been posted to the homework area of this page. Remember that homework deadlines are strict, so please start working on the problems early. Updated versions of the slides for Part I (Chapters 1-4) are now availble from the textbook's Web page.
2016/01/05: Computer architect Gene Amdahl (1922-2015): Amdahl is best known for a law he formulated in the 1960s which states that the speed-up achieved through parallel processing would be limited to 1/f at best, if the program being run has a fraction f of its operations that are inherently sequential (unparallelizable). For example, if 5% of a program's operations are inherently sequential, speed-up can never exceed 20, no matter how many processors we throw at the problem. I have recently shown in my own work that what is known as Amdahl's Law for parallel processing speed-up can also be applied to system reliability improvement (IEEE Computer magazine, issue of July 2015). Lesser known, but just as important, is Amdahl's formulation of a set of rules of thumb for system balance that establish relationships between processing speed, memory requirements, and I/O performance. Amdahl worked at IBM for many years, where he was the principal architect for the famed System 360 series of mainframe computers. Later, his company, Amdahl Corporation, manufactured plug-compatible versions of IBM mainframes and introduced many software innovations to improve the performance and reliability of computer systems. Amdahl passed away recently at age 92.
2015/11/20: Welcome to the ECE 254B web page for winter 2016. As of today, 16 students have signed up to take the course. The following information is provided for planning purposes only. Details will be finalized in late December and updated regularly thereafter. I will be updating and improving the on-line lecture slides as the course proceeds, so the winter 2016 contents will be different from the current version. Please pay attention to the associated posting date when downloading material for the course.

Course Calendar


Course lectures, homework assignments, exams, and research milestones have been scheduled as follows. This schedule will be strictly observed. In particular, no extension is possible for homework due dates or research milestones. Each lecture covers topics in 1-2 chapters of the textbook. Chapter numbers are provided in parentheses, after day & date. PowerPoint and PDF files of the lecture slides can be found on the textbook's web page.

Day & Date (book chapters) Lecture topic [Homework posted/due] {Special notes}
M 01/04 (1) Introduction to parallel processing
W 01/06 (2) A taste of parallel algorithms

M 01/11 (3-4) Complexity and parallel computation models [HW1 posted, chs. 1-4]
W 01/13 (5) The PRAM shared-memory model and basic algorithms

M 01/18 No lecture: Martin Luther King Holiday
W 01/20 (6A) More shared-memory algorithms [HW1 due] [HW2 posted, chs. 5-6C]

M 01/25 (6B-6C) Shared memory implementations and abstractions
W 01/27 (7) Sorting and selection networks [HW3 posted, chs. 7-8C]

M 02/01 (8A) Search acceleration circuits [HW2 due]
W 02/03 (8B-8C) Other circuit-level examples

M 02/08 (9) Sorting on a 2D mesh or torus architectures [HW3 due]
W 02/10 (1-8) Midterm exam, 10:00-11:45 AM: closed-book; a simple calculator is permitted

M 02/15 No lecture: President's Day Holiday
W 02/17 (10) Routing on a 2D mesh or torus architectures [HW4 posted, chs. 9-12]

M 02/22 (11) Other algorithms for mesh/torus architectures
W 02/24 (12) Mesh/torus variations and extensions

M 02/29 (13) Hypercubes and their algorithms [HW4 due] [HW5 posted, chs. 13-16]
W 03/02 (14) Sorting and routing on hypercubes

M 03/07 (15-16) Other interconnection architectures {Instructor/course evaluation surveys}
W 03/09 (17-18) Task scheduling and input/output [HW5 due]

M 03/14 (1-16) Final exam, 8:00-11:00 AM: closed-book; a simple calculator is permitted

T 03/22 {Course grades due by midnight}

Homework Assignments

Homework image

-Turn in solutions in class before the lecture begins.
-Because solutions will be handed out on the due date, no extension can be granted.
-Use a cover page that includes your name, course name, and assignment number.
-Staple the sheets and write your name on top of each sheet in case they are separated.
-Although some cooperation is permitted, direct copying will have severe consequences

Homework 1: Introduction, models, and complexity (chs. 1-4, due W 2014/01/20, 10:00 AM)
Do the following problems from the textbook or defined below: 1.7, 1.10, 1.22, 2.9, 3.6ab, 3.8
1.22   Top supercomputers in the world
Read the paper [Stro15] and answer the following questions in less than one page of typeset text (single spacing okay, if needed).
a. What is the significance of the Top500 supercomputers list?
b. In Figure 1, why is the topmost curve wavy whereas the other two are relatively smooth?
c. How many processors are used in the current top supercomputer on the Top500 list?
d. What is the current performance level of the top supercomputer on the Top500 list?
e. When do you think exaflops-level performance will be achieved?
[Stro15] Strohmaier, E., H. W. Meuer, J. Dongarra, and H. D. Simon, "The TOP500 List and Progress in High-Performance Computing," IEEE Computer, Vol. 48, No. 11, pp. 42-49, November 2015.

Homework 2: Shared memory model of parallel processing (chs. 5-6C, due M 2014/02/01, 10:00 AM)
Do the following problems from the textbook: 5.3, 5.5, 5.12, 6.13, 6.14, 16.13
Correction   The one-line introduction to Problem 6.14 should be replaced by:
Consider the butterfly memory access network depicted in Fig. 6.9, but with only eight processors and eight memory banks, one connected to each circular node in columns 0 and 3 of the butterfly.

Homework 3: Circuit model of parallel processing (chs. 7-8C, due M 2014/02/08, 10:00 AM)
Do the following problems from the textbook or defined below: 7.2, 7.7, 7.20, 8.6abc, 8.8
7.20   Delay and cost of sorting networks
Let Dmin(n) and Cmin(n) be the minimal delay and minimal cost for an n-sorter constructed only of 2-sorters (the two lower bounds may not be realizable in a single circuit).
a. Prove or disprove: Dmin(n) is a strictly increasing function of n; i.e., Dmin(n + 1) > Dmin(n).
b. Prove or disprove: Cmin(n) is a strictly increasing function of n; i.e., Cmin(n + 1) > Cmin(n).

Homework 4: Mesh- and torus-connected computers (chs. 9-12, due M 2014/02/29, 10:00 AM)
Do the following problems from the textbook or defined below: 9.8, 9.9, 10.9, 10.14abc, 11.2, 12.24
12.24   Twin torus topology
Read the paper [Andu15] about h-D twin torus topology and answer the following questions.
[Andu15] Andujar-Munoz, F. J., J. A. Villar-Ortiz, J. L. Sanchez, F. J. Alfaro, and J. Duato, "N-Dimensional Twin Torus Topology," IEEE Trans. Computers, Vol. 64, No. 10, pp. 2847-2861, October 2015.
a. Describe the proposed topology in no more than two sentences.
b. What is the main advantage of twin torus over ordinary torus with the same number of nodes?
c. What is a primary disadvantage of twin torus compared with ordinary torus? Explain.
d. Propose a generalization to triplet torus topology and cite its advantages and disadvantages.

Homework 5: Hypercubic and other networks (chs. 13-16, due W 2014/03/09, 10:00 AM)
Do the following problems from the textbook: 13.2abcd, 13.14, 14.10, 15.3, 16.3abc

Sample Exams and Study Guides

Answer sheet

Updated on February 5, 2016, for winter 2016.
The following sample exam using problems from the textbook is meant to indicate the types and levels of problems, rather than the coverage (which is outlined in the course calendar). Students are responsible for all sections and topics, in the textbook and class handouts, that are not explicitly excluded in the study guide that follows the sample exam, even if the material was not covered in class lectures.

Sample Midterm Exam (105 minutes)
Textbook problems 2.3, 3.5, 5.5 (with i + s corrected to j + s), 7.6a, and 8.4ac; note that problem statements might change a bit for a closed-book exam.

Midterm Exam Study Guide
The following sections are excluded from Chapters 1-8 of the textbook to be covered in the midterm exam, including the six new chapters named 6A-C (expanding on Chpater 6) and 8A-C (expanding on Chapter 8):
2.6, 3.5, 4.5, 4.6, 6A.6, 6B.3, 6B.5, 6C.3, 6C.4, 6C.5, 6C.6, 7.5, 7.6, 8A.5, 8A.6, 8B.2, 8B.5, 8B.6

Sample Final Exam (180 minutes)
Textbook problems 1.11, 6.14, 9.5, 10.5, 13.6, 14.10, 16.1; note that problem statements might change a bit for a closed-book exam.

Final Exam Study Guide
The following sections are excluded from Chapters 1-16 of the textbook to be covered in the final exam: All midterm exclusions, plus 9.6, 12.6, 13.5, 15.5, 16.5, 16.6

Research Paper and Presentation [does not apply to winter 2016]

Colored marbles

Each student will review a subfield of parallel processing or do original research on a selected and approved topic. A tentative list of research topics is provided below; however, students should feel free to propose their own topics for approval. A publishable report earns an "A" for the course, regardless of homework grades. See the course calendar for schedule and due dates and Research Paper Guidlines for formatting tips.

1. Shared Memory Consistency: Models and Implementations (Assigned to: TBD)
[Stei04] Steinke, R. C. and G. J. Nutt, "A Unified Theory of Shared Memory Consistency," J. ACM, Vol. 51, No. 5, pp. 800-849, September 2004.
[Adve10] Adve, S. V. and H.-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware," Communications of the ACM, Vol. 53, No. 8, pp. 90-101, August 2010.

2. Area/Time/Power Trade-offs in Designing Universal Circuits (Assigned to: TBD)
[Bhat08] Bhatt, S. N., G. Bilardi, and G. Pucci, "Area-Time Tradeoffs for Universal VLSI Circuits," Theoretical Computer Science, Vol. 408, Nos. 2-3, pp. 143-150, November 2008.
[Leis85] Leiserson, C. E., "Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing," IEEE Trans. Computers, Vol. 3, No. 10, pp. 892-901, October 1985.

3. Optimized Interconnection Networks for Parallel Processing (Assigned to: TBD)
[Gupt06] Gupta, A. K., and W. J. Dally, "Topology Optimization of Interconnection Networks," IEEE Computer Architecture Letters, Vol. 5, No. 1, pp. 10-13, January-June 2006.
[Ahon04] Ahonen, T., D. A. Siguenza-Tortosa, H. Bin, and J. Nurmi, "Topology Optimization for Application-Specific Networks-on-Chip," Proc. Int'l Workshop System-Level Interconnect Prediction, pp. 53-60, 2004.

4. Trade-offs in Low- vs High-Dimensional Meshes and Tori (Assigned to: TBD)
[Dall90] Dally, W. J., "Performance Analysis of k-ary n-cube Interconnection Networks," IEEE Trans. Computers, Vol. 39, No. 6, pp. 775-785, June 1990.
[Agar91] Agarwal, A., "Limits on Interconnection Network Performance," IEEE Trans. Parallel and Distributed Systems, Vol. 2, No. 4, pp. 398-412, October 1991.

5. Implementing Deadlock-Free Routing via Turn Prohibition (Assigned to: TBD)
[Glas94] Glass, C. J. and L. M. Ni, "The Turn Model for Adaptive Routing," J. ACM, Vol. 41, No. 5, pp. 874-902, September 1994.
[Levi10] Levitin, L., M. Karpovsky, and M. Mustafa, "Minimal Sets of Turns for Breaking Cycles in Graphs Modeling Networks," IEEE Trans. Parallel and Distributed Systems, Vol. 21, No. 9, pp. 1342-1353, September 2010.

6. Swapped and Biswapped Networks: A Comparative Study (Assigned to: TBD)
[Parh05] Parhami, B., "Swapped Interconnection Networks: Topological, Performance, and Robustness Attributes," J. Parallel and Distributed Computing, Vol. 65, No. 11, pp. 1443-1452, November 2005.
[Xiao10] Xiao, W. J., B. Parhami, W. D. Chen, M. X. He, and W. H. Wei "Fully Symmetric Swapped Networks Based on Bipartite Cluster Connectivity," Information Processing Letters, Vol. 110, No. 6, pp. 211-215, 15 February 2010.

7. Robust Task Scheduling Algorithms for Parallel Processors (Assigned to: TBD)
[Ghos97] Ghosh, S., R. Melhem, and D. Mosse, "Fault Tolerance through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems," IEEE Trans. Parallel and Distributed Systems, Vol. 8, No. 3, pp. 272-284, March 1997.
[Beno08] Benoit, A., M. Hakem, and Y. Robert, "Fault Tolerant Scheduling of Precedence Task Graphs on Heterogeneous Platforms," Proc. Int'l. Symp. Parallel and Distributed Processing, pp. 1-8, 2008.

8. Artificial Neural Networks as Parallel Systems and Algorithms (Assigned to: TBD)
[Take92] Takefuji, Y., Neural Network Parallel Computing, Kluwer, 1992.
[Yao99] Yao, X., "Evolving Artificial Neural Networks," Proc. IEEE, Vol. 87, No. 9, pp. 1423-1447, September 1999.

9. Adaptable Parallelism for Real-Time Performance and Reliability (Assigned to: TBD)
[Moro96] Moron, C. E., "Designing Adaptable Real-Time Fault-Tolerant Parallel Systems," Proc. Int'l Parallel Processing Symp., pp. 754-758, 1996.
[Hsiu09] Hsiung, P.-A., C.-H. Huang, and Y.-H. Chen, "Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC," J. Embedded Computing, Vol. 3, No. 1, pp. 53-62, 2009.

10. Distributed System-Level Malfunction Diagnosis in Multicomputers (Assigned to: TBD)
[Soma87] Somani, A. K., V. K. Agarwal, and D. Avis, "A Generalized Theory for System Level Diagnosis," IEEE Trans. Computers, Vol. 36, pp. 538-546, 1987
[Pelc91] Pelc, A., "Undirected Graph Models for System-Level Fault Diagnosis," IEEE Trans. Computers, Vol. 40, No. 11, pp. 1271-1276, November 1991.

11. The MapReduce Approach to Parallel Processing (Assigned to: TBD)
[Dean10] Dean, J. and S. Ghemawat, "MapReduce: A Flexible Data Processing Tool," Communications of the ACM, Vol. 53, No. 1, pp. 72-77, January 2010.
[Ston10] Stonebraker, M., et al., "MapReduce and Parallel DBMSs: Friends or Foes?" Communications of the ACM, Vol. 53, No. 1, pp. 64-71, January 2010.

12. Transactional Memory: Concept and Implementations (Assigned to: TBD)
[Laru08] Larus, J. and C. Kozyrakis, "Transactional Memory," Communications of the ACM, Vol. 51, No. 7, pp. 80-88, July 2008.
[Dice09] Dice, D., Y. Lev, M. Moir, and D. Nussbaum, "Early Experience with a Commercial Hardware Transactional Memory Implementation," Proc. 14th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 2009, pp. 157-168.

13. The Notion of Reliability Wall in Parallel Computing (Assigned to: TBD)
[Yang12] Yang, X., Z. Wang, J. Xue, and Y. Zhou, "The Reliability Wall for Exascale Supercomputing," IEEE Trans. Computers, Vol. 61, No. 6, pp. 767-779, June 2012.
[Zhen09] Zheng, Z. and Z. Lan, "Reliability-Aware Scalability Models for High-Performance Computing," Proc. Int'l Conf. Cluster Computing, 2009, pp. 1-9.

14. FPGA-Based Implementation of Application-Specific Parallel Systems (Assigned to: TBD)
[Wang03] Wang, X. and S. G. Ziavras, "Parallel Direct Solution of Linear Equations on FPGA-Based Machines," Proc. Int'l Parallel and Distributed Processing Symp., 2003.
[Wood08] Woods, R., J. McAllister, G. Lightbody, and Y. Yi, FPGA-Based Implementation of Signal Processing Systems, Wiley, 2008.

15. Biologically-Inspired Parallel Algorithms and Architectures (Assigned to: TBD)
[Furb09] Furber, S., "Biologically-Inspired Massively-Parallel Architectures—Computing Beyond a Million Processors," Proc. 9th Int'l Conf. Application of Concurrency to System Design, 2009, pp. 3-12.
[Lewi09] Lewis, A., S. Mostaghim, and M. Randall (eds.), Biologically-Inspired Optimization Methods, Springer, 2009.

Poster Presentation Tips [does not apply to winter 2016]

Poster format

Here are some guidelines for preparing your research poster. The idea of the poster is to present your research results and conclusions thus far, get oral feedback during the session from the instructor and your peers, and to provide the instructor with something to comment on before your final report is due. Please send a PDF copy of the poster via e-mail by midnight on the poster presentation day.

Posters prepared for conferences must be colorful and eye-catching, as they are typically competing with dozens of other posters for the attendees' attention. Here is an example of a conference poster. Such posters are often mounted on a colored cardboard base, even if the pages themselves are standard PowerPoint slides. In our case, you should aim for a "plain" poster (loose sheets, to be taped to the wall in our classroom) that conveys your message in a simple and direct way. Eight to 10 pages, each resembling a PowerPoint slide, would be an appropriate goal. You can organize the pages into 2 x 4 (2 columns, 4 rows), 2 x 5, or 3 x 3 array on the wall. The top two of these might contain the project title, your name, course name and number, and a very short (50-word) abstract. The final two can perhaps contain your conclusions and directions for further work (including work that does not appear in the poster, but will be included in your research report). The rest will contain brief description of ideas, with emphasis on diagrams, graphs, tables, and the like, rather than text which is very difficult to absorb for a visitor in a very limited time span.

Grade Statistics


HW1 grades: Range = [55, 99], Mean = 86, Median = 87, SD = 13
HW2 grades: Range = [44, 92], Mean = 75, Median = 73, SD = 17
HW3 grades: Range = [54, 86], Mean = 69, Median = 68, SD = 11
HW4 grades: Range = [48, 92], Mean = 75, Median = 77, SD = 15
HW5 grades: Range = [64, 93], Mean = 82, Median = 87, SD = 12
Overall HW grades: Range = [59, 90], Mean = 78, Median = 76, SD = 11
Midterm exam grades: Range = [74, 92], Mean = 82, Median = 84, SD = 9
Final exam grades: Range = [60, 92], Mean = 80, Median = 80, SD = 11
All grades listed above are in percent.
Course letter grades: Range = [B, A+], Mean = 3.6, Median = 3.7, SD = 0.4


Image of a reference book

Required text: B. Parhami, Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, 1999. Make sure that you visit the textbook's web page which contains an errata. Lecture slides are also available there.
Optional recommended book: Gropp, W., E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, 3rd ed., MIT Press, 2014. Because the focus of our course is on architecture and its interplay with algorithms, this book, which deals primarily with software and programming topics, constitutes helpful supplementary reading.
Research resources:
The follolwing journals contain a wealth of information on new developments in parallel processing: IEEE Trans. Parallel and Distributed Systems, IEEE Trans. Computers, J. Parallel & Distributed Computing, Parallel Computing, Parallel Processing Letters, ACM Trans. Parallel Computing. Also, see IEEE Computer and IEEE Concurrency (the latter ceased publication in late 2000) for broad introductory articles.
The following are the main conferences of the field: Int'l Symp. Computer Architecture (ISCA, since 1973), Int'l Conf. Parallel Processing (ICPP, since 1972), Int'l Parallel & Distributed Processing Symp. (IPDPS, formed in 1998 by merging IPPS/SPDP, which were held since 1987/1989), and ACM Symp. Parallel Algorithms and Architectures (SPAA, since 1988).
UCSB library's electronic journals, collections, and other resources

Miscellaneous Information

Motivation: The ultimate efficiency in parallel systems is to achieve a computation speedup factor of p with p processors. Although often this ideal cannot be achieved, some speedup is generally possible by using multiple processors in a concurrent (parallel or distributed) system. The actual speed gain depends on the system's architecture and the algorithm run on it. This course focuses on the interplay of architectural and algorithmic speedup techniques. More specifically, the problem of algorithm design for "general-purpose" parallel systems and its "converse," the incorporation of architectural features to help improve algorithm efficiency and, in the extreme, the design of algorithm-based special-purpose parallel architectures, are dealt with. The foregoing notions will be covered in sufficient detail to allow extensions and applications in a variety of contexts, from network processors, through desktop computers, game boxes, Web server farms, multiterabyte storage systems, and mainframes, to high-end supercomputers.

Catalog entry: 254B. Advanced Computer Architecture: Parallel Processing(4) PARHAMI. Prerequisites: ECE 254A. Lecture, 4 hours. The nature of concurrent computations. Idealized models of parallel systems. Practical realization of concurrency. Interconnection networks. Building-block parallel algorithms. Algorithm design, optimality, and efficiency. Mapping and scheduling of computations. Example multiprocessors and multicomputers.

History: The graduate course ECE 254B was created by Dr. Parhami, shortly after he joined UCSB in 1988. It was first taught in spring 1989 as ECE 594L, Special Topics in Computer Architecture: Parallel and Distributed Computations. A year later, it was converted to ECE 251, a regular graduate course. In 1991, Dr. Parhami led an effort to restructure and update UCSB's graduate course offerings in the area of computer architecture. The result was the creation of the three-course sequence ECE 254A/B/C to replace ECE 250 (Adv. Computer Architecture) and ECE 251. The three new courses were designed to cover high-performance uniprocessing, parallel computing, and distributed computer systems, respectively. In 1999, based on a decade of experience in teaching ECE 254B, Dr. Parhami published the textbook Introduction to Parallel Processing: Algorithms and Architectures (Website).
Offering of ECE 254B in winter 2016 (PDF file)
Offering of ECE 254B in winter 2014 (PDF file)
Offering of ECE 254B in winter 2013 (PDF file)
Offering of ECE 254B in fall 2010 (PDF file)
Offering of ECE 254B in fall 2008 (PDF file)
Offerings of ECE 254B from 2000 to 2006 (PDF file)