Textbook on Parallel Processing

      


Behrooz Parhami: 2007/06/19 ||  E-mail: parhami@ece.ucsb.edu  ||  Problems: webadmin@ece.ucsb.edu

Other contact info at: Bottom of this page  ||  Go up to: B. Parhami's teaching and textbooks or his home page

 

On June 19, 2007, Professor Parhami's UCSB ECE website moved to a new location. For an up-to-date version of this page, visit it at the new address: http://www.ece.ucsb.edu/~parhami/text_par_proc.htm

      

B. Parhami, Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, New York, 1999, 532 + xxi pp., ISBN 0-306-45970-1. Available for purchase from Kluwer Academic Publishers and various college or on-line bookstores.


Return to: Top of this page

Preface

The context of parallel processing

The field of digital computer architecture has grown explosively in the past two decades. Through a steady stream of experimental research, tool-building efforts, and theoretical studies, the design of  an instruction-set architecture, once considered an art, has been transformed  into one of the most quantitative branches of computer technology. At the same time, better understanding of various forms of concurrency, from standard pipelining to massive parallelism, and invention of architectural structures to support a reasonably efficient and user-friendly programming model for such systems, has allowed hardware performance to continue its exponential growth. This trend is expected to continue in the near future.

This explosive growth, linked with the expectation that performance will continue its exponential rise with each new generation of hardware and that (in stark contrast to software) computer hardware will function correctly as soon as it comes off the assembly line, has its down side. It has led to unprecedented hardware complexity and almost intolerable development costs. The challenge facing current and future computer designers is to institute simplicity where we now have complexity; to use fundamental theories being developed in this area to gain performance and ease-of-use benefits from simpler circuits; to understand the interplay between technological capabilities and limitations, on the one hand, and design decisions based on user and application requirements on the other.

In the computer designers’ quest for user-friendliness, compactness, simplicity, high performance, low cost, and low power, parallel processing plays a key role. High-performance uniprocessors are becoming increasingly complex, expensive, and power-hungry. A basic tradeoff thus exists between the use of  one or a small number of such complex processors, at one extreme, and a moderate to very large number of simpler processors, at the other. When combined with a high-bandwidth, but logically simple, inter-processor communication facility, the latter approach leads to significant simplification of the design process. However, two major roadblocks have thus far prevented the widespread adoption of such moderately to massively parallel architectures: The inter-processor communication bottleneck and the difficulty, and thus high cost, of algorithm/software development.

The above context is changing due to several factors. First, at very high clock rates, the link between the processor and memory becomes very critical. CPUs can no longer be designed and verified in isolation. Rather, an integrated processor/memory design optimization is required which makes the development even more complex and costly. VLSI technology now allows us to put more transistors on a chip than required by even the most advanced superscalar processor. The bulk of these transistors are now being used to provide additional on-chip memory. However, they can just as easily be used to build multiple processors on a single chip. Emergence of multiple-processor microchips, along with currently available methods for glueless combination of several chips into a larger systems and maturing standards for parallel machine models, hold the promise for making parallel processing more practical.   

This is why parallel processing occupies such a prominent place in computer architecture education and research. New parallel architectures appear with amazing regularity in technical publications, while older architectures are studied and analyzed in novel and insightful ways. The wealth of published theoretical and practical results on parallel architectures and algorithms is truly awe-inspiring. The emergence of standard programming and communication models has removed some of the concerns with compatibility and software design issues in parallel processing, thus resulting in new designs and products with mass-market appeal. Given the computation-intensive nature of many application areas (such as encryption, physical modeling, and multimedia), parallel processing will continue to thrive for years to come.

Perhaps, as parallel processing matures further, it will start to become invisible. Packing many processors in a computer might constitute as much a part of a future computer architect’s toolbox as pipelining, cache memories, and multiple instruction issue do today. In this scenario, even though the multiplicity of processors will not affect the end user or even the professional programmer (other than of course boosting the system performance), the number might be mentioned in sales literature to lure customers, in the same way that clock frequency and cache size are now used. The challenge will then shift from making parallel processing work to incorporating a larger number of processors, more economically and in a truly seamless fashion.

The goals and structure of this book

The field of parallel processing has matured to the point that scores of texts and reference books have been published. Some of these books that cover parallel processing in general (as opposed to some special aspects of the field or advanced/unconventional parallel systems) are listed at the end of this preface. Each of these books has its unique strengths and has contributed to the formation and fruition of the field. The current text, Introduction to Parallel Processing: Algorithms and Architectures, is an outgrowth of lecture notes that the author has developed and refined over many years, beginning in the mid-1980s. Here are the most important features of this text in comparison to the listed books:

a.     Division of material into lecture-size chapters:  In my approach to teaching, a lecture is a more or less self-contained module with links to past lectures and pointers to what will transpire in future. Each lecture must have a theme or title and must proceed from motivation, to details, to conclusion. There must be smooth transitions between lectures and a clear enunciation of how each lecture fits into the overall plan. In designing the text, I have strived to divide the material into chapters, each of which is suitable for one lecture (1-2 hours). A short lecture can cover the first few subsections while a longer lecture might deal with more advanced material near the end. To make the structure hierarchical, as opposed to flat or linear, chapters have been grouped into six parts, each composed of four closely related chapters (see the diagram below).

b.     A large number of meaningful problems: At least 13 problems have been provided at the end of each of the 24 chapters. These are well thought out problems, many of them class-tested, that complement the material in the chapter, introduce new viewing angles, and link the chapter material to topics in other chapters.

c.     Emphasis on both the underlying theory and practical designs: The ability to cope with complexity requires both a deep knowledge of the theoretical underpinnings of parallel processing and examples of designs that help us understand the theory. Such designs also provide hints/ideas for synthesis as well as reference points for cost-performance comparisons. This viewpoint is reflected, e.g., in the coverage of problem-driven parallel machine designs (Chapter 8) that point to the origins of the butterfly and binary-tree architectures. Other examples are found in Chapter 16 where a variety of composite and hierarchical architectures are discussed and some fundamental cost/performance tradeoffs in network design are exposed. Fifteen carefully chosen case studies in Chapters 21-23 provide additional insight and motivation for the theories discussed.

d.     Linking parallel computing to other subfields of computer design: Parallel computing is nourished by, and in turn feeds, other subfields of computer architecture and technology. Examples of such links abound. In computer arithmetic, the design of high-speed adders and multipliers contributes to, and borrows many methods from, parallel processing. Some of the earliest parallel systems were designed by researchers in the field of fault-tolerant computing in order to allow independent multi-channel computations and/or dynamic replacement of failed subsystems. These links are pointed out throughout the book.

e.     Wide coverage of important topics: The current text covers virtually all important architectural and algorithmic topics in parallel processing, thus offering a balanced and complete view of the field. Coverage of the circuit model and problem-driven parallel machines (Chapters 7-8), some variants of mesh architectures (Chapter 12), composite and hierarchical systems (Chapter 16), which are becoming increasingly important for overcoming VLSI layout and packaging constraints, and the topics in Part V (Chapters 17-20) do not all appear in other textbooks. Similarly, other books that cover the foundations or parallel processing do not contain discussions on practical implementation issues and case studies of the type found in Part VI.

f.      Unified and consistent notation/terminology throughout the text: I have tried very hard to use consistent notation/terminology throughout the text. For example, n always stands for the number of data elements (problem size) and p for the number of processors. While other authors have done this in the basic parts of their texts, there is a tendency to cover more advanced research topics by simply borrowing the notation and terminology from the reference source. Such an approach has the advantage of making the transition between reading the text and the original reference source easier, but it is utterly confusing to the majority of the students who rely on the text and do not consult the original references except, perhaps, to write a research paper.

Summary of topics

The 6 parts of this book, each composed of 4 chapters, have been written with the following goals:

Part I sets the stage, gives a taste of what is to come, and provides the needed perspective, taxonomy, and analysis tools for the rest of the book.

Part II delimits the models of parallel processing from above (the abstract PRAM model) and from below (the concrete circuit model), preparing the reader for everything else that falls in the middle.

Part III presents the scalable, and conceptually simple, mesh model of parallel processing, which has become quite important in recent years, and also covers some of its derivatives.

Part IV covers low-diameter parallel architectures and their algorithms, including the hypercube, hypercube derivatives, and a host of other interesting interconnection topologies.

Part V includes broad (architecture-independent) topics that are relevant to a broad range of systems and form the stepping stones to effective and reliable parallel processing.

Part VI deals with implementation aspects and properties of various classes of parallel processors, presenting many case studies and projecting a view of the past and future of the field.

Pointers on how to use the book

For classroom use, the topics in each chapter of this text can be covered in a lecture of duration 1-2 hours. In my own teaching, I have used the chapters primarily for 1.5-hour lectures, twice a week, in a 10-week quarter, omitting or combining some chapters to fit the material into 18-20 lectures. But the modular structure of the text lends itself to other lecture formats, self-study, or review of the field by practitioners. In the latter two cases, the readers can view each chapter as a study unit (for one week, say) rather than as a lecture. Ideally, all topics in each chapter should be covered before moving to the next chapter. However, if fewer lecture hours are available, then some of the subsections located at the end of chapters can be omitted or introduced only in terms of motivations and key results. 

Problems of varying complexities, from straightforward numerical examples or exercises to more demanding studies or mini-projects, have been supplied for each chapter. These problems form an integral part of the book and have not been added as afterthoughts to make the book more attractive for use as a text. A total of 358 problems are included (13-16 per chapter). Assuming that  two lectures are given per week, either weekly or bi-weekly homework can be assigned, with each assignment having the specific coverage of the respective half-part (2 chapters) or full-part (4 chapters) as its “title”. In this format, the half-parts, shown in the diagram below, provide a focus for the weekly lecture and/or homework schedule.

An instructor’s manual, with problem solutions and enlarged versions of the diagrams and tables, suitable for reproduction as transparencies, is planned. The author’s detailed syllabus for the course ECE 254B at UCSB is available from the author’s Web page (look under “Teaching and Textbooks”).

References to important or state-of-the-art research contributions and designs are provided at the end of each chapter. These references provide good starting points for doing in-depth studies or for preparing term papers/projects. 

New ideas in the field of parallel processing appear in papers presented at several annual conferences, known as FMPC, ICPP, IPPS, SPAA, SPDP (now merged with IPPS), and in archival journals such as IEEE Transactions on Computers [Tcom], IEEE Transactions on Parallel and Distributed Systems [TPDS], Journal of Parallel and Distributed Computing [JPDC], Parallel Computing [ParC], and Parallel Processing Letters [PPL]. Tutorial and survey papers of wide scope appear in IEEE Concurrency [Conc] and, occasionally, in IEEE Computer [Comp]. The articles in IEEE Computer provide excellent starting points for research projects and term papers.

Acknowledgments

The current text, Introduction to Parallel Processing: Algorithms and Architectures, is an outgrowth of lecture notes that the author has used for the graduate course “ECE 254B: Advanced Computer Architecture -- Parallel Processing” at the University of California, Santa Barbara, and, in rudimentary forms, at several other institutions prior to 1988. The text has benefited greatly from keen observations, curiosity, and encouragement of my many students in these courses. A sincere thanks to all of them! Particular thanks go to Dr. Ding-Ming Kwai who read an early version of the manuscript carefully and suggested numerous corrections and improvements.

General references

[Akl89]         Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989.

[Akl97]         Akl, S.G., Parallel Computation: Models and Methods, Prentice Hall, 1997.

[Alma94]      Almasi, G.S. and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, 2nd Ed., 1994.

[Bert89]        Bertsekas, D.P. and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, 1989.

[Code93]      Codenotti, B. and M. Leoncini, Introduction to Parallel Processing, Addison-Wesley, 1993.

[Comp]         IEEE Computer, Journal published by IEEE Computer Society; has occasional special issues on parallel/distributed Processing (Feb. 1982, June 1985, Aug. 1986, June 1987, Mar. 1988, Aug. 1991, Feb. 1992, Nov. 1994, Nov. 1995, Dec. 1996).

[Conc]          IEEE Concurrency, formerly IEEE Parallel & Distributed Technology, Magazine published by IEEE Computer Society.

[Cric88]        Crichlow, J.M., Introduction to Distributed and Parallel Computing, Prentice-Hall, 1988.

[DeCe89]     DeCegama, A.L., Parallel Processing Architectures and VLSI Hardware, Prentice Hall, 1989.

[Desr87]       Desrochers, G.R., Principles of Parallel and Multiprocessing, McGraw-Hill, 1987.

[Duat97]      Duato, J., S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, IEEE Computer Society Press, 1997.

[Flyn95]       Flynn, M.J., Computer Architecture: Pipelined and Parallel Processor Design, Jones and Bartlett, 1995.

[FMPC]        Proc. Symp. Frontiers of Massively Parallel Computation, Sponsored by IEEE Computer Society and NASA. Held every 1.5-2 years since 1986. The 6th FMPC was held in Annapolis, MD, Oct. 27-31, 1996, and the 7th is planned for ...

[Foun94]      Fountain, T.J., Parallel Computing: Principles and Practice, Cambridge Univ. Press, 1994.

[Hock81]      Hockney, R.W. and C.R. Jesshope, Parallel Computers, Adam Hilger, 1981.

[Hord90]      Hord, R.M., Parallel Supercomputing in SIMD Architectures, CRC Press, 1990.

[Hord93]      Hord, R.M., Parallel Supercomputing in MIMD Architectures, CRC Press, 1993.

[Hwan84]     Hwang, K. and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984.

[Hwan93]     Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993.

[Hwan98]     Hwang, K. and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, 1998.

[ICPP]          Proc. Int’l Conference Parallel Processing, Sponsored by The Ohio State Univ. (and in recent years, also by The Int’l Association for Computers and Communications). Held annually since 1972. The 27th ICPP was held in Minneapolis, MN, Aug. 10-15, 1998, and the 28th is scheduled for Sep. 21-24, 1999, in Aizu, Japan.

[IPPS]           Proc. Int’l Parallel Processing Symp., Sponsored by IEEE Computer Society. Held annually since 1987. The 11th IPPS was held in Geneva, Switzerland, Apr. 1-5, 1997. Beginning with the 1988 symposium in Orlando, FL, Mar. 30 - Apr. 3, IPPS was merged with SPDP. The next joint IPPS/SPDP is scheduled for Apr. 12-16, 1999, in San Juan, Puerto Rico.

[JaJa92]        JaJa, J., An Introduction to Parallel Algorithms, Addison-Wesley, 1992.

[JPDC]         Journal of Parallel & Distributed Computing, Published by Academic Press.

[Kris89]        Krishnamurthy, E.V., Parallel Processing: Principles and Practice, Addison-Wesley, 1989.

[Kuma94]     Kumar, V., A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 1994.

[Laks90]       Lakshmivarahan, S. and S.K. Dhall, Analysis and Design of Parallel Algorithms: Arithmetic and Matrix Problems, McGraw-Hill, 1990.

[Leig92]       Leighton, F.T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan-Kaufman, 1992.

[Lerm94]      Lerman, G. and L. Rudolph, Parallel Evolution of Parallel Processors, Plenum, 1994.

[Lipo87]       Lipovski, G.J. and M. Malek, Parallel Computing: Theory and Comparisons, Wiley, 1987.

[Mold93]     Moldovan, D.I., Parallel Processing: From Applications to Systems, Morgan Kaufmann, 1993.

[ParC]           Parallel Computing, Journal published by North-Holland.

[PPL]            Parallel Processing Letters, Journal published by World Scientific.

[Quin87]      Quinn, M.J., Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, 1987.

[Quin94]      Quinn, M.J., Parallel Computing: Theory and Practice, McGraw-Hill, 1994.

[Reif93]        Reif, J.H. (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1993.

[Sanz89]       Sanz, J.L.C. (Ed.), Opportunities and Constraints of Parallel Computing (IBM/NSF Workshop, San Jose, CA, Dec. 1988), Springer-Verlag, 1989.

[Shar87]       Sharp, J.A., An Introduction to Distributed and Parallel Processing, Blackwell Scientific Publications, 1987.

[Sieg85]       Siegel, H.J., Interconnection Networks for Large-Scale Parallel Processing, Lexington Books, 1985.

[SPAA]        Proc. Symp. Parallel Algorithms and Architectures, Sponsored by the Association for Computing Machinery (ACM). Held annually since 1989. The 9th SPAA was held in Newport, RI, June 22-25, 1997, and the 10th is planned ...

[SPDP]         Proc. Int’l Symp. Parallel & Distributed Systems, Sponsored by IEEE Computer Society. Held annually since 1989, except for 1997. The 8th SPDP was held in New Orleans, LA, Oct. 23-26, 1996. Beginning with the 1998 symposium in Orlando, FL, Mar. 30 - Apr. 3, SPDP was merged with IPPS.

[Ston93]       Stone, H.S., High-Performance Computer Architecture, Addison-Wesley, 1993.

[TCom]        IEEE Trans. Computers, Journal published by IEEE Computer Society; has occasional special issues on parallel and distributed processing (Apr. 1987, Dec. 1988, Aug. 1989, Dec. 1991, Apr. 1997, Apr. 1998).

[TPDS]         IEEE Trans. Parallel & Distributed Systems, Journal published by IEEE Computer Society.

[Varm94]      Varma, A. and C.S. Raghavendra, Interconnection Networks for Multiprocessors and Multicomputers: Theory and Practice, IEEE Computer Society Press, 1994.

[Zoma96]     Zomaya, A.Y. (Ed.), Parallel and Distributed Computing Handbook, McGraw-Hill, 1996.


Return to: Top of this page

Contents at a Glance

      

      

The structure of this book in parts, half-parts, and chapters.

      

      


Return to: Top of this page

Complete Table of Contents

Note: Each chapter ends with problems and references.

Part I   Fundamental Concepts    

1   Introduction to Parallelism  

1.1 Why parallel processing?
1.2 A Motivating Example
1.3 Parallel processing ups and downs
1.4 Types of parallelism: A taxonomy
1.5 Roadblocks to parallel processing
1.6 Effectiveness of parallel processing

2   A Taste of Parallel Algorithms

2.1 Some simple computations
2.2 Some simple architectures
2.3 Algorithms for a linear array
2.4 Algorithms for a binary tree
2.5 Algorithms for a 2D mesh
2.6 Algorithms with shared variables

3   Parallel Algorithm Complexity

3.1 Asymptotic complexity
3.2 Algorithm optimality and efficiency
3.3 Complexity classes
3.4 Parallelizable tasks and the NC class
3.5 Parallel programming paradigms
3.6 Solving recurrences

4   Models of Parallel Processing

4.1 Development of early models
4.2 SIMD versus MIMD architectures
4.3 Global versus distributed memory
4.4 The PRAM shared-memory model
4.5 Distributed-memory or graph models
4.6 Circuit model and physical realizations

Part II   Extreme Models

5   PRAM and Basic Algorithms

5.1 PRAM submodels and assumptions
5.2 Data broadcasting
5.3 Semigroup or fan-in computation
5.4 Parallel prefix computation
5.5 Ranking the elements of a linked list
5.6 Matrix multiplication

6   More Shared-Memory Algorithms

6.1 Sequential rank-based selection
6.2 A parallel selection algorithm
6.3 A selection-based sorting algorithm
6.4 Alternative sorting algorithms
6.5 Convex hull of a 2-D point set
6.6 Some implementation aspects

7   Sorting and Selection Networks

7.1 What is a sorting network?
7.2 Figures of merit for sorting networks
7.3 Design of sorting networks
7.4 Batcher sorting networks
7.5 Other classes of sorting networks
7.6 Selection networks

8   Other Circuit-Level Examples

8.1 Searching and dictionary operations
8.2 A tree-structured dictionary machine
8.3 Parallel prefix computation
8.4 Parallel prefix networks
8.5 The discrete Fourier transform
8.6 Parallel architectures for FFT

Part III   Mesh-Based Architectures

9   Sorting on a 2-D Mesh or Torus

9.1 Mesh-connected computers
9.2 The shearsort algorithm
9.3 Variants of simple shearsort
9.4 Recursive sorting algorithms
9.5 A non-trivial lower bound
9.6 Achieving the lower bound

10   Routing on a 2-D Mesh or Torus

10.1 Types of data routing
10.2 Useful elementary operations
10.3 Data routing on a 2-D array
10.4 Greedy routing algorithms
10.5 Other classes of routing algorithms
10.6 Wormhole routing

11   Numerical 2-D Mesh Algorithms

11.1 Matrix multiplication
11.2 Triangular system of equations
11.3 Tridiagonal system
11.4 Arbitrary system of linear equations
11.5 Graph algorithms
11.6 Image-processing algorithms

12   Other Mesh-Related Architectures

12.1 Three or more dimensions
12.2 Stronger and weaker connectivities
12.3 Meshes augmented with non-local links
12.4 Meshes with dynamic links
12.5 Pyramid and multi-grid systems
12.6 Meshes of trees

Part IV Low-Diameter Architectures

13   Hypercubes and Their Algorithms

13.1 Definition and main properties
13.2 Embeddings and their usefulness
13.3 Embedding of arrays and trees
13.4 A few simple algorithms
13.5 Matrix multiplication
13.6 Inverting a lower triangular matrix

14   Sorting and Routing on Hypercubes

14.1 Defining the sorting problem
14.2 Bitonic sorting
14.3 Routing problems on a hypercube
14.4 Dimension-order routing
14.5 Broadcasting on a hypercube
14.6 Adaptive and fault-tolerant routing

15    Other Hypercubic Architectures

15.1 Modified and generalized hypercubes
15.2 Butterfly and permutation networks
15.3 Plus-or-minus-2i network
15.4 The cube-connected cycles
15.5 Shuffle and shuffle-exchange networks
15.6 That's not all, folks!

16   A Sampler of Other Networks

16.1 Performance parameters for networks
16.2 Star and pancake networks
16.3 Ring-based networks
16.4 Composite or hybrid networks
16.5 Hierarchical (multi-level) networks
16.6 Multi-stage interconnection networks

Part V  Some Broad Topics

17   Emulation and Scheduling

17.1 Emulations among architectures
17.2 Distributed shared memory
17.3 The task scheduling problem
17.4 A class of scheduling algorithms
17.5 Some useful bounds for scheduling
17.6 Load balancing and dataflow systems

18   Data Storage, Input, and Output

18.1 Data access problems and caching
18.2 Cache coherence protocols
18.3 Multi-threading and latency hiding
18.4 Parallel I/O technology
18.5 Redundant disk arrays
18.6 Interfaces and standards

19   Reliable Parallel Processing

19.1 Defects, faults, ... , failures
19.2 Defect-level methods
19.3 Fault-level methods
19.4. Error-level methods
19.5 Malfunction-level methods
19.6 Degradation-level methods

20 System and Software Issues

20.1 Coordination and synchronization
20.2 Parallel programming
20.3 Software portability and standards
20.4 Parallel operating systems
20.5 Parallel file systems
20.6 Hardware/software interaction

Part VI   Implementation Aspects

21   Shared-Memory MIMD Machines

21.1 Variations in shared memory
21.2 MIN-based BBN Butterfly
21.3 Vector-parallel Cray Y-MP
21.4 Latency-tolerant Tera MTA
21.5 CC-NUMA Stanford DASH
21.6 SCI-based Sequent NUMA-Q

22 Message-Passing MIMD Machines

22.1 Mechanisms for message passing
22.2 Reliable bus-based Tandem NonStop
22.3 Hypercube-based nCUBE3
22.4 Fat-tree-based Connection Machine 5
22.5 Omega-network-based IBM SP2
22.6 Commodity-driven Berkeley NOW

23   Data-Parallel SIMD Machines

23.1 Where have all the SIMDs gone?
23.2 The first supercomputer: ILLIAC IV
23.3 Massively parallel Goodyear MPP
23.4 Distributed Array Processor (DAP)
23.5 Hypercubic Connection Machine 2
23.6 Multi-connected MasPar MP-2

24   Past, Present, and Future

24.1 Milestones in parallel processing
24.2 Current status, issues, and debates
24.3 TFLOPS, PFLOPS, and beyond
24.4 Processor and memory technologies
24.5 Interconnection technologies
24.6 The future of parallel processing

Index


Return to: Top of this page

From the Preface to the Instructor’s Manuals

This instructor’s manual consists of two volumes. Volume 1 presents solutions to selected problems and includes additional problems (many with solutions) that did not make the cut for inclusion in the text Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999) or that were designed after the book went to print. Volume 2 contains enlarged versions of the figures and tables in the text as well as additional material, presented in a format that is suitable for use as transparency masters.

The winter 2002 edition Volume 1, which consists of the following parts, is available to qualified instructors through the publisher:

Volume 1          Part I                Selected solutions and additional problems

                                          Part II              Question bank, assignments, and projects

The winter 2002 edition of Volume 2, which consists of the following parts, is available as a large downloadable through the book’s Web page (see Presentations below):

Volume 2          Parts I-VI        Lecture slides and other presentation material

The book’s Web page, given below, also contains an errata and a host of other material (please note the upper-case “F” and “P” and the underscore symbol after “text” and “par”):

http://www.ece.ucsb.edu/Faculty/Parhami/text_par_proc.htm

The author would appreciate the reporting of any error in the textbook or in this manual, suggestions for additional problems, alternate solutions to solved problems, solutions to other problems, and sharing of teaching experiences. Please e-mail your comments to

            parhami@ece.ucsb.edu

or send them by regular mail to the author’s postal address:

            Department of Electrical and Computer Engineering

            University of California

            Santa Barbara, CA 93106-9560, USA

Contributions will be acknowledged to the extent possible.


Return to: Top of this page 

Presentations

The following versions of the Instructor's Manual, Vol. 2, are available for downloading:

The following PowerPoint and pdf presentation files (prepared in 2005 and last updated in spring 2006, 1-2 MB each) supersede the manuals just listed:

  • Presentation for Part I: Fundamental Concepts (pdf, ppt)

  • Presentation for Part II: Extreme Models (pdf, ppt)

  • Presentation for Part III: Mesh-Based Architectures (pdf, ppt)

  • Presentation for Part IV: Low-Diameter Architectures (pdf, ppt)

  • Presentation for Part V: Some Broad Topics (pdf, ppt)

  • Presentation for Part VI: Implementation Aspects (pdf, ppt)


Return to: Top of this page

Errors and Additions

The following additions and corrections are presented in the order of page numbers in the book. Each entry is assigned a unique code, which appears in square brackets and begins with the date of posting or last modification in the format "yymmdd". As errors are removed in future printings, the codes in square brackets will be augmented to include the printing number which no longer contains the error.

Note: To find out which printing of the book you own, look at the bottom of the copyright page; there, you see a sequence of numbers "10  9  8  . . .", the last of which is the printing.

p. 10            [010410c] In Fig. 1.5, lower right corner, the label "(b)" is redundant and should be removed.

p. 13            [010410d] In the right panel of Fig. 1.9, replace "Real" with "Actual".

p. 21            [030220] In Problem 1.1, part a, replace “MIPS rating” by “IPS rating”.

p. 21            [001212] In Problem 1.5, replace “For each of the values, 1, 10, and 100 for the parameter b, determine the range of values for c where . . .” by “Determine the range of values for b and c where . . .”.

p. 23            [001212a] Add [Bokh87] to the reference list for Chapter 1.

[Bokh87] Bokhari, S., “Multiprocessing the Sieve of Eratosthenes,” IEEE Computer, Vol. 20, No. 4, pp. 50-58, April 1987.

p. 50            [001212b] Table 3.3, leftmost column, last entry: align “1” with the numbers above it.

p. 50            [001212c] Table 3.3, column headed 100 sqrt(n): the first entry should be 5 min (not .5 min).

p. 50            [010410a] Table 3.3, column headed 100 sqrt(n): the next to the last entry should be 9 hr (not 1 yr).

p. 50            [010410b] Table 3.3, column headed 100 sqrt(n): the last entry should be 1 day (not 3 yr).

p. 55            [001212d] Evidence for the suspicion that NC ¹ P fall into three categories [Gree95, pp. 61-70]:

a.        The best general simulations of sequential models yield modest speed-ups of log T or sqrt(T), where T is the sequential time. Furthermore, these simulations need 2^(T^W(1)), or exponentially many, processors. A fast general simulation would allow us to run sequential programs in parallel with good speed-up. Such a “super” parallelizing compiler is impractical.

b.       All known fast simulations (parallelizations) have very limited domains of applicability. For example, we know that certain machines with limited resources (e.g., time and space) can recognize languages that are in NC but that the resources must increase exponentially to cover all languages in P. Hence, NC = P would mean that this exponential increase in resources does not add to the machine’s computing power. This is quite unlikely.

c.        Certain natural approaches to parallelization have been proven to be insufficient.

For a list of P-complete problems, see [Gree95, pp. 119-220]. Open problems, that are not known to be P-complete or in NC, are listed in [Gree95, pp. 221-243].

[Gree95] Greenlaw, R., H.J. Hoover, and W.L. Ruzzo, Limits to Parallel Computation: P-Completeness Theory, Oxford Univ. Press, 1995.

p. 59            [010330] Near the bottom of the page, the solution given for g(n) = g(n - 1) + 1/2 is incorrect; the correct solution, (n - 1)/2, makes the answer different from that obtained from unrolling. The problem stems from the fact that the assumptions f(1) = 0 and g(1) = 0 are incompatible, given the definition f(n) = an^2 + g(n). The simplest fix is to replace "assuming g(1) = 0" on the third line from the bottom of the page with "assuming g(2) = 0". An alternate fix is to write f(n) = an^2 + g(n) - a on line 9 from the bottom, arguing that this form makes f(1) = g(1) = 0 possible.

p. 79            [020104] Before last paragraph add: Both LogP and BSP model the network bandwidth limitation on a per-processor basis. It is also possible to impose a global bandwidth limit (of, say, p/g). This can be shown to lead to a fundamentally stronger model [Adle99] that should not be used if the hardware in fact imposes a per-processor limit.

                    [Adle99] Adler, M., P.B. Gibbons, Y. Matias, and V. Ramachandran, “Modeling Parallel Bandwidth: Local versus Global Restrictions,” Algorithmica, Vol. 24, pp. 381-404, 1999.

p. 97            [030220a] On lines 2 and 3 from bottom, replace “having more processors than items” by “having fewer processors than items”.

p. 103          [001212f] Line 3: Change “m^2 processors” to “m^2 processors” (italicize m).

p. 124          [010410] Lines 11-12 are to be slightly modified as follows: This is a butterfly network that we will encounter again in Chapter 8, where we devise a circuit for computing the fast Fourier transform (FFT), and in Chapter 15, (insert two commas and remove "again")

p. 125          [001212g] At the end of Section 6.6, add: For a good discussion of hot-spot contention, see [Dand99].

                    [Dand99] Dandamudi, S.P., “Reducing Hot-Spot Contention in Shared-Memory Multiprocessor Systems,” IEEE Concurrency, Vol. 7, No. 1, pp. 48-59, January-March 1999.  

p. 127          [020123] Replace the one-line introduction of Problem 6.14 with the following: Consider the butterfly memory access network depicted in Fig. 6.9, but with only eight processors and eight memory banks, one connected to each circular node in columns 0 and 3 of the butterfly.

p. 135          [010501] On line 7 in the first paragraph of Section 7.3, n floor(n/2) should be n(n - 1)/2.

p. 138          [010524] Lines 6-7: The 0s and 1s on line 6, and the w_i terms right below them, should be vertically aligned to appear halfway between the 0s and 1s on line 5.

p. 138          [010524a] Last line: The rightmost symbol in the equation for the cost C(n) of Batcher's even-odd-merge sorter should be 4, not 2.

p. 141          [010919] Becker, Nassimi, and Perl [Beck98] provide a generalization of periodic sorting networks of Dowd et al [Dowd89] that leads to a large class of similarly structured networks with identical latencies. The networks of Dowd et al represent one extreme (the most complex) in this class. The networks at the other extreme (the simplest) have n/2 - 1 fewer comparators in each of their  log_2 n stages.

                    [Beck98] Becker, R.I., D. Nassimi, and Y. Perl, “The New Class of g-Chain Periodic Sorters,” J. Parallel and Distributed Computing, Vol. 54, pp. 206-222, 1998.

p. 209          [010105] In re to Problem 10.14, a comprehensive survey of interval routing appears in [Gavo00].

                    [Gavo00] Gavoille, C., “A Survey of Interval Routing,” Theoretical Computer Science, Vol. 245, pp. 217-253, 28 Aug. 2000.

p. 213          [001212h] The following diagram complements Fig. 11.1 (shows how Fig. 11.1 was derived).  

      

p. 216          [010516] Replace “back substitution” with “forward substitution” throughout the discussion and in the algorithm title. After the definition of forward substitution, following “... and so forth” add: (the corresponding method for upper triangular systems is called back substitution, because evaluation goes backward from x_n–1 to x_0).

p. 220          [020308a] In line 2 after Figure 11.10,change "theta(sqrt(p))" to "theta(sqrt(m))"; that is, replace p with m.

p. 229          [001212j] In the caption to Figure 11.20, change "Lavialdi's algorithm" to "Levialdi's algorithm in the shrinkage phase" (note spelling correction).

p. 230          [001212k] In the caption to Figure 11.21, change "Lavialdi" to "Levialdi".

p. 231          [001212m] In lines 9 and 22, change "Lavialdi" to "Levialdi".

p. 232          [010516a] Change "Back substitution" in the title and first line of Problem 11.4 to "Forward substitution".

p. 233          [001212n] In the title to Problem 11.13 and in parts a, c, and d, change "Lavialdi" to "Levialdi".

p. 233          [001212p] In the reference labeled [Lavi72], change "Lavialdi" to "Levialdi". Also, change the reference label to [Levi72].

p. 233          [001212q] Add [Math87] to the reference list for Chapter 11. {Contains a good exposition of matrix algorithms, including LU factorization and its comparison with Gaussian elimination.}

                    [Math87] Mathews, J.H., Numerical Methods for Computer Science, Engineering, and Mathematics, Prentice-Hall, 1987.

p. 248          [010521] Line 8: Change the exponent of 2 from 2l - 2 to l - 2; i.e., the number of processors in an l-level mesh of trees is p = (2^l)(3 ´ 2^(l-2) - 1).

p. 253          [050523] Add the following explanation to part b of Problem 12.1: The restriction of interprocessor communication to a single dimension at a time for all processors is known as the uniaxis communication model.

p. 257          [001212r] Line 8: Change “q-D” to “qD”.

p. 266          [020308b] In line 10 after Fig. 13.3, replace "a torus" with "a 2D torus".

p. 266          [020308c] In the last two lines before Fig. 13.4, remove the open parenthesis before 2^m_0 and the closed parenthesis after 2^m_h-1.

p. 276          [020308] At the end of Problem 13.6, change "from 0 to 2^q - 2" to "from 1 to 2^q - 1". {The original statement is not incorrect, but the modified statement makes the solution somewhat cleaner.}

p. 277          [010521a] Part a of Problem 13.13: Change “back substitution” to “forward substitution”.

p. 290          [001212s] Fig. 14.7 caption should read: Packing is a “good” problem for dimension-order routing on the hypercube.

p. 291          [001212t] Fig. 14.8 caption should read: Bit-reversal permutation is a “bad” problem for dimension-order routing on the hypercube.

p. 305          [001212u] Line 7: 0(2q) should be o(2q); i.e., change “0” to lower case roman “oh”.

p. 312          [001212v] The following diagram is a graphical representation of the communication pattern in ascend, descend, and normal algorithms.

      

p. 314          [010521b] On the next to the last line (above Fig. 15.17), at the beginning of the line, "011011" should be replaced by "01011011".

p. 319          [030220b] In Problem 15.11, part d, italicize the last occurrence of x on the last line.

p. 328          [050523a] On line 6, the diameter of star graph should be given as ë3(q - 1)/2û, instead of 2q - 3.

p. 328          [050523b] On line 7, Change "optimal" to "suboptimal" and delete the sentence beginning with "Note, however".

p. 341          [050523c] On line 4 of the introduction to Problem 16.4, change "the other 2q bits" to "the other 2q - 2 bits".

p. 343          [001213] Add [Lisz97] to the reference list for Chapter 16. {Good overview paper, suggesting that it is difficult to compare networks fairly. We don’t know if analytically tractable models can predict real behavior. It is not at all clear how diverse networks used within distinct architectures can be compared with regard to cost-effectiveness.}

                    [Lisz97] Liszka, K.J., J.K. Antonio, and H.J. Siegel, “Problems with Comparing Interconnection Networks: Is an Alligator Better than an Armadillo?”, IEEE Concurrency, Vol. 5, No. 4, pp. 18-28, October–December 1997.

p. 350          [010601] Line 2 from bottom of page: Change "five-node graph" to "four-node graph".

p. 361          [010601a] Line 16: Change "upper bound for" to "approximation to".

p. 361          [010521c] Line 17: In the formula for speed-up, change "£ =" to "»".

p. 366          [040105] In part a of Problem 17.12, change "scheduling theorem" to "scheduling algorithm".

p. 376          [020311] In Fig. 18.4, change the label of the topmost horizontal arrow, going from the "Exclusive" state to the "Shared" state, to: Read miss: Fetch data value, return data value, sharing set = sharing set + {c}

p. 378          [001213a] Before the last paragraph of Section 18.3, add: Within certain limits, multithreading may have the same effect on performance as parallel processing. The results of [Sato99] show that for some numerical applications, performance depends on the product of the number of nodes and the number of threads. For example, in multiplying large matrices, the speed-ups achieved are almost identical with 2 nodes at 4 threads, 4 nodes at 2 threads, and 8 nodes at a single thread.

                    [Sato99] Sato, M., “COMPaS: A PC-Based SMP Cluster,” IEEE Concurrency, Vol. 7, No. 1, pp. 82-86, January-March 1999.

p. 398          [001213b] At the end of the next to last paragraph, add: However, if we are allowed to change the external connections to the array, so that either the top or bottom row (leftmost or rightmost column) can be viewed as a spare row (column), then all triple defects and a larger fraction of other defect patterns become tolerable.

p. 404          [001213c] At the end of the first full paragraph, add: For a review of algorithm-based fault tolerance methods, see [Vija97].

                    [Vija97] Vijay, M. and R. Mittal, “Algorithm-Based Fault Tolerance: A Review,” Microprocessors and Microsystems, Vol. 21, pp. 151-161, 1997.

p. 405          [050523d] On line 31, change "P4 being the only healthy unit" to "P3 being the only healthy unit".

p. 418          [001213d] After the description of Fig. 20.2 in Section 20.1, add: For a discussion of hot-spot contention in shared-memory systems, see the end of Section 6.6.

p. 427          [001213e] Section 20.3 – For updates on PVM see:  www.netlib.org/pvm3/book/node1.html

p. 435          [001213f] Part (c) of problem 20.9: Change “with” to “WITH”.

p. 457          [001213g] Add [Dahl99] to the reference list for Chapter 21. {Reviews simple, hierarchical, and flat COMA architectures and qualitatively compares them to NUMA variants. Concludes that a combined architecture allowing NUMA/COMA handling of data on a per-page basis may be the best solution.}

                    [Dahl99] Dahlgren, F. and J. Torrellas, “Cache-Only Memory Architectures”, IEEE Computer, Vol. 32, No. 6, pp. 72-79, June 1999.

p. 464          [001213h] An excellent introduction to Myricom’s Myrinet switch can be found in [Bode95].

                    [Bode95] Boden, N.J., et al., “Myrinet: A Gigabit-per-Second Local Area Network,” IEEE Micro, Vol. 15, No. 1, pp. 29-36, Feb. 1995.

p. 508          [020104a] For an update on the course of the ASCI program, including data on the 3 TFLOPS Blue Pacific (IBM, at Lawrence Livermore Lab), the 3 TFLOPS Blue Mountain (SG/Cray, at Los Alamos), and the 12 TFLOPS ASCI White (IBM), see [Clar98]. ASCI White is also described and depicted in the Nov. 2001 issue of IEEE Spectrum, pp. 29-31.

                    [Clar98] Clark, D., “ASCI Pathforward: To 30 Tflops and Beyond,” IEEE Concurrency, Vol. 6, No. 2, pp. 13-15, Apr.-June 1998.

p. 508          [001213i] An overview of the Intel Pentium Pro project can be found in [Papw96].

                    [Papw96] Papworth, D.B., “Tuning the Pentium Pro Microarchitecture,” IEEE Micro, Vol. 16, No. 2, pp. 8-15, Apr. 1996.

                    For information of a competing processor architecture, the MIPS R10000, see [Yeag96].

                    [Yeag96] Yeager, K.C., “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, Vol. 16, No. 2, pp. 28-40, Apr. 1996.

      

Return to: Top of this page  ||  Go up to: B. Parhami's teaching and textbooks or his home page

      


Dr. Behrooz Parhami, Professor

                     Office phone: +1 805 893 3211
E-mail: parhami@ece.ucsb.edu                 Messages: +1 805 893 3716
Dept. Electrical & Computer Eng.                  Dept. fax: +1 805 893 3262
Univ. of California, Santa Barbara                Office: Room 5155 Eng. I
Santa Barbara, CA 93106-9560 USA                      Deliveries: Room 4155 Eng. I