|
Textbook on Parallel Processing
Behrooz Parhami: 2007/06/19 || E-mail: parhami@ece.ucsb.edu || Problems: webadmin@ece.ucsb.edu Other contact info at: Bottom of this page || Go up to: B. Parhami's teaching and textbooks or his home page
On June 19, 2007, Professor Parhami's UCSB ECE website moved to a new location. For an up-to-date version of this page, visit it at the new address: http://www.ece.ucsb.edu/~parhami/text_par_proc.htm
B. Parhami, Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, New York, 1999, 532 + xxi pp., ISBN 0-306-45970-1. Available for purchase from Kluwer Academic Publishers and various college or on-line bookstores. Return to: Top of this page Preface The context of parallel processing
The
field of digital computer architecture has grown explosively in the past two
decades. Through a steady stream of experimental research, tool-building
efforts, and theoretical studies, the design of
an instruction-set architecture, once considered an art, has been
transformed into one of the most
quantitative branches of computer technology. At the same time, better
understanding of various forms of concurrency, from standard pipelining to
massive parallelism, and invention of architectural structures to support a
reasonably efficient and user-friendly programming model for such systems, has
allowed hardware performance to continue its exponential growth. This trend is
expected to continue in the near future. This
explosive growth, linked with the expectation that performance will continue its
exponential rise with each new generation of hardware and that (in stark
contrast to software) computer hardware will function correctly as soon as it
comes off the assembly line, has its down side. It has led to unprecedented
hardware complexity and almost intolerable development costs. The challenge
facing current and future computer designers is to institute simplicity where we
now have complexity; to use fundamental theories being developed in this area to
gain performance and ease-of-use benefits from simpler circuits; to understand
the interplay between technological capabilities and limitations, on the one
hand, and design decisions based on user and application requirements on the
other. In
the computer designers’ quest for user-friendliness, compactness, simplicity,
high performance, low cost, and low power, parallel processing plays a key role.
High-performance uniprocessors are becoming increasingly complex, expensive, and
power-hungry. A basic tradeoff thus exists between the use of
one or a small number of such complex processors, at one extreme, and a
moderate to very large number of simpler processors, at the other. When combined
with a high-bandwidth, but logically simple, inter-processor communication
facility, the latter approach leads to significant simplification of the design
process. However, two major roadblocks have thus far prevented the widespread
adoption of such moderately to massively parallel architectures: The
inter-processor communication bottleneck and the difficulty, and thus high cost,
of algorithm/software development. The
above context is changing due to several factors. First, at very high clock
rates, the link between the processor and memory becomes very critical. CPUs can
no longer be designed and verified in isolation. Rather, an integrated
processor/memory design optimization is required which makes the development
even more complex and costly. VLSI technology now allows us to put more
transistors on a chip than required by even the most advanced superscalar
processor. The bulk of these transistors are now being used to provide
additional on-chip memory. However, they can just as easily be used to build
multiple processors on a single chip. Emergence of multiple-processor
microchips, along with currently available methods for glueless combination of
several chips into a larger systems and maturing standards for parallel machine
models, hold the promise for making parallel processing more practical. This
is why parallel processing occupies such a prominent place in computer
architecture education and research. New parallel architectures appear with
amazing regularity in technical publications, while older architectures are
studied and analyzed in novel and insightful ways. The wealth of published
theoretical and practical results on parallel architectures and algorithms is
truly awe-inspiring. The emergence of standard programming and communication
models has removed some of the concerns with compatibility and software design
issues in parallel processing, thus resulting in new designs and products with
mass-market appeal. Given the computation-intensive nature of many application
areas (such as encryption, physical modeling, and multimedia), parallel
processing will continue to thrive for years to come. Perhaps,
as parallel processing matures further, it will start to become invisible.
Packing many processors in a computer might constitute as much a part of a
future computer architect’s toolbox as pipelining, cache memories, and
multiple instruction issue do today. In this scenario, even though the
multiplicity of processors will not affect the end user or even the professional
programmer (other than of course boosting the system performance), the number
might be mentioned in sales literature to lure customers, in the same way that
clock frequency and cache size are now used. The challenge will then shift from
making parallel processing work to incorporating a larger number of processors,
more economically and in a truly seamless fashion. The
goals and structure of this book The
field of parallel processing has matured to the point that scores of texts and
reference books have been published. Some of these books that cover parallel
processing in general (as opposed to some special aspects of the field or
advanced/unconventional parallel systems) are listed at the end of this preface.
Each of these books has its unique strengths and has contributed to the
formation and fruition of the field. The current text, Introduction to Parallel Processing: Algorithms and Architectures,
is an outgrowth of lecture notes that the author has developed and refined over
many years, beginning in the mid-1980s. Here are the most important features of
this text in comparison to the listed books: a.
Division of material into lecture-size chapters: In
my approach to teaching, a lecture is a more or less self-contained module with
links to past lectures and pointers to what will transpire in future. Each
lecture must have a theme or title and must proceed from motivation, to details,
to conclusion. There must be smooth transitions between lectures and a clear
enunciation of how each lecture fits into the overall plan. In designing the
text, I have strived to divide the material into chapters, each of which is
suitable for one lecture (1-2 hours). A short lecture can cover the first few
subsections while a longer lecture might deal with more advanced material near
the end. To make the structure hierarchical, as opposed to flat or linear,
chapters have been grouped into six parts, each composed of four closely related
chapters (see the diagram below). b.
A large number of meaningful problems: At least 13 problems have
been provided at the end of each of the 24 chapters. These are well thought out
problems, many of them class-tested, that complement the material in the
chapter, introduce new viewing angles, and link the chapter material to topics
in other chapters. c.
Emphasis on both the underlying theory and practical designs: The
ability to cope with complexity requires both a deep knowledge of the
theoretical underpinnings of parallel processing and examples of designs that
help us understand the theory. Such designs also provide hints/ideas for
synthesis as well as reference points for cost-performance comparisons. This
viewpoint is reflected, e.g., in the coverage of problem-driven parallel machine
designs (Chapter 8) that point to the origins of the butterfly and binary-tree
architectures. Other examples are found in Chapter 16 where a variety of
composite and hierarchical architectures are discussed and some fundamental
cost/performance tradeoffs in network design are exposed. Fifteen carefully
chosen case studies in Chapters 21-23 provide additional insight and motivation
for the theories discussed. d.
Linking parallel computing to other subfields of computer design: Parallel
computing is nourished by, and in turn feeds, other subfields of computer
architecture and technology. Examples of such links abound. In computer
arithmetic, the design of high-speed adders and multipliers contributes to, and
borrows many methods from, parallel processing. Some of the earliest parallel
systems were designed by researchers in the field of fault-tolerant computing in
order to allow independent multi-channel computations and/or dynamic replacement
of failed subsystems. These links are pointed out throughout the book. e.
Wide coverage of important topics: The current text covers virtually
all important architectural and algorithmic topics in parallel processing, thus
offering a balanced and complete view of the field. Coverage of the circuit
model and problem-driven parallel machines (Chapters 7-8), some variants of mesh
architectures (Chapter 12), composite and hierarchical systems (Chapter 16),
which are becoming increasingly important for overcoming VLSI layout and
packaging constraints, and the topics in Part V (Chapters 17-20) do not all
appear in other textbooks. Similarly, other books that cover the foundations or
parallel processing do not contain discussions on practical implementation
issues and case studies of the type found in Part VI. f.
Unified and consistent notation/terminology throughout the text: I
have tried very hard to use consistent notation/terminology throughout the text.
For example, n always stands for the
number of data elements (problem size) and p for the number of processors. While other authors have done this
in the basic parts of their texts, there is a tendency to cover more advanced
research topics by simply borrowing the notation and terminology from the
reference source. Such an approach has the advantage of making the transition
between reading the text and the original reference source easier, but it is
utterly confusing to the majority of the students who rely on the text and do
not consult the original references except, perhaps, to write a research paper. Summary
of topics The
6 parts of this book, each composed of 4 chapters, have been written with the
following goals: Part
I sets the stage, gives a taste of what is to come, and provides the needed
perspective, taxonomy, and analysis tools for the rest of the book. Part
II delimits the models of parallel processing from above (the abstract PRAM
model) and from below (the concrete circuit model), preparing the reader for
everything else that falls in the middle. Part
III presents the scalable, and conceptually simple, mesh model of parallel
processing, which has become quite important in recent years, and also covers
some of its derivatives. Part
IV covers low-diameter parallel architectures and their algorithms, including
the hypercube, hypercube derivatives, and a host of other interesting
interconnection topologies. Part
V includes broad (architecture-independent) topics that are relevant to a broad
range of systems and form the stepping stones to effective and reliable parallel
processing. Part
VI deals with implementation aspects and properties of various classes of
parallel processors, presenting many case studies and projecting a view of the
past and future of the field. Pointers
on how to use the book For
classroom use, the topics in each chapter of this text can be covered in a
lecture of duration 1-2 hours. In my own teaching, I have used the chapters
primarily for 1.5-hour lectures, twice a week, in a 10-week quarter, omitting or
combining some chapters to fit the material into 18-20 lectures. But the modular
structure of the text lends itself to other lecture formats, self-study, or
review of the field by practitioners. In the latter two cases, the readers can
view each chapter as a study unit (for one week, say) rather than as a lecture.
Ideally, all topics in each chapter should be covered before moving to the next
chapter. However, if fewer lecture hours are available, then some of the
subsections located at the end of chapters can be omitted or introduced only in
terms of motivations and key results. Problems
of varying complexities, from straightforward numerical examples or exercises to
more demanding studies or mini-projects, have been supplied for each chapter.
These problems form an integral part of the book and have not been added as
afterthoughts to make the book more attractive for use as a text. A total of 358
problems are included (13-16 per chapter). Assuming that
two lectures are given per week, either weekly or bi-weekly homework can
be assigned, with each assignment having the specific coverage of the respective
half-part (2 chapters) or full-part (4 chapters) as its “title”. In this
format, the half-parts, shown in the diagram below, provide a focus for the
weekly lecture and/or homework schedule. An
instructor’s manual, with problem solutions and enlarged versions of the
diagrams and tables, suitable for reproduction as transparencies, is planned.
The author’s detailed syllabus for the course ECE 254B at UCSB is available
from the author’s Web page (look under “Teaching
and Textbooks”). References
to important or state-of-the-art research contributions and designs are provided
at the end of each chapter. These references provide good starting points for
doing in-depth studies or for preparing term papers/projects.
New
ideas in the field of parallel processing appear in papers presented at several
annual conferences, known as FMPC, ICPP, IPPS, SPAA, SPDP (now merged with IPPS),
and in archival journals such as IEEE Transactions on Computers [Tcom], IEEE Transactions on Parallel and Distributed Systems [TPDS], Journal
of Parallel and Distributed Computing [JPDC], Parallel
Computing [ParC], and Parallel
Processing Letters [PPL]. Tutorial and survey papers of wide scope appear in
IEEE Concurrency [Conc] and, occasionally, in IEEE Computer [Comp]. The articles in IEEE Computer provide excellent starting points for research
projects and term papers. Acknowledgments The
current text, Introduction to Parallel
Processing: Algorithms and Architectures, is an outgrowth of lecture notes
that the author has used for the graduate course “ECE 254B: Advanced Computer
Architecture -- Parallel Processing” at the University of California, Santa
Barbara, and, in rudimentary forms, at several other institutions prior to 1988.
The text has benefited greatly from keen observations, curiosity, and
encouragement of my many students in these courses. A sincere thanks to all of
them! Particular thanks go to Dr. Ding-Ming Kwai who read an early version of
the manuscript carefully and suggested numerous corrections and improvements. General
references [Akl89] Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989. [Akl97] Akl, S.G., Parallel Computation: Models and Methods, Prentice Hall, 1997. [Alma94] Almasi, G.S. and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, 2nd Ed., 1994. [Bert89] Bertsekas, D.P. and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, 1989. [Code93] Codenotti, B. and M. Leoncini, Introduction to Parallel Processing, Addison-Wesley, 1993. [Comp] IEEE Computer, Journal published by IEEE Computer Society; has occasional special issues on parallel/distributed Processing (Feb. 1982, June 1985, Aug. 1986, June 1987, Mar. 1988, Aug. 1991, Feb. 1992, Nov. 1994, Nov. 1995, Dec. 1996). [Conc] IEEE Concurrency, formerly IEEE Parallel & Distributed Technology, Magazine published by IEEE Computer Society. [Cric88] Crichlow, J.M., Introduction to Distributed and Parallel Computing, Prentice-Hall, 1988. [DeCe89] DeCegama, A.L., Parallel Processing Architectures and VLSI Hardware, Prentice Hall, 1989. [Desr87] Desrochers, G.R., Principles of Parallel and Multiprocessing, McGraw-Hill, 1987. [Duat97] Duato, J., S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, IEEE Computer Society Press, 1997. [Flyn95] Flynn, M.J., Computer Architecture: Pipelined and Parallel Processor Design, Jones and Bartlett, 1995. [FMPC] Proc. Symp. Frontiers of Massively Parallel Computation, Sponsored by IEEE Computer Society and NASA. Held every 1.5-2 years since 1986. The 6th FMPC was held in Annapolis, MD, Oct. 27-31, 1996, and the 7th is planned for ... [Foun94] Fountain, T.J., Parallel Computing: Principles and Practice, Cambridge Univ. Press, 1994. [Hock81] Hockney, R.W. and C.R. Jesshope, Parallel Computers, Adam Hilger, 1981. [Hord90] Hord, R.M., Parallel Supercomputing in SIMD Architectures, CRC Press, 1990. [Hord93] Hord, R.M., Parallel Supercomputing in MIMD Architectures, CRC Press, 1993. [Hwan84] Hwang, K. and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984. [Hwan93] Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993. [Hwan98] Hwang, K. and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, 1998. [ICPP] Proc. Int’l Conference Parallel Processing, Sponsored by The Ohio State Univ. (and in recent years, also by The Int’l Association for Computers and Communications). Held annually since 1972. The 27th ICPP was held in Minneapolis, MN, Aug. 10-15, 1998, and the 28th is scheduled for Sep. 21-24, 1999, in Aizu, Japan. [IPPS] Proc. Int’l Parallel Processing Symp., Sponsored by IEEE Computer Society. Held annually since 1987. The 11th IPPS was held in Geneva, Switzerland, Apr. 1-5, 1997. Beginning with the 1988 symposium in Orlando, FL, Mar. 30 - Apr. 3, IPPS was merged with SPDP. The next joint IPPS/SPDP is scheduled for Apr. 12-16, 1999, in San Juan, Puerto Rico. [JaJa92] JaJa, J., An Introduction to Parallel Algorithms, Addison-Wesley, 1992. [JPDC] Journal of Parallel & Distributed Computing, Published by Academic Press. [Kris89] Krishnamurthy, E.V., Parallel Processing: Principles and Practice, Addison-Wesley, 1989. [Kuma94] Kumar, V., A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 1994. [Laks90] Lakshmivarahan, S. and S.K. Dhall, Analysis and Design of Parallel Algorithms: Arithmetic and Matrix Problems, McGraw-Hill, 1990. [Leig92] Leighton, F.T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan-Kaufman, 1992. [Lerm94] Lerman, G. and L. Rudolph, Parallel Evolution of Parallel Processors, Plenum, 1994. [Lipo87] Lipovski, G.J. and M. Malek, Parallel Computing: Theory and Comparisons, Wiley, 1987. [Mold93] Moldovan, D.I., Parallel Processing: From Applications to Systems, Morgan Kaufmann, 1993. [ParC] Parallel Computing, Journal published by North-Holland. [PPL] Parallel Processing Letters, Journal published by World Scientific. [Quin87] Quinn, M.J., Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, 1987. [Quin94] Quinn, M.J., Parallel Computing: Theory and Practice, McGraw-Hill, 1994. [Reif93] Reif, J.H. (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1993. [Sanz89] Sanz, J.L.C. (Ed.), Opportunities and Constraints of Parallel Computing (IBM/NSF Workshop, San Jose, CA, Dec. 1988), Springer-Verlag, 1989. [Shar87] Sharp, J.A., An Introduction to Distributed and Parallel Processing, Blackwell Scientific Publications, 1987. [Sieg85] Siegel, H.J., Interconnection Networks for Large-Scale Parallel Processing, Lexington Books, 1985. [SPAA] Proc. Symp. Parallel Algorithms and Architectures, Sponsored by the Association for Computing Machinery (ACM). Held annually since 1989. The 9th SPAA was held in Newport, RI, June 22-25, 1997, and the 10th is planned ... [SPDP] Proc. Int’l Symp. Parallel & Distributed Systems, Sponsored by IEEE Computer Society. Held annually since 1989, except for 1997. The 8th SPDP was held in New Orleans, LA, Oct. 23-26, 1996. Beginning with the 1998 symposium in Orlando, FL, Mar. 30 - Apr. 3, SPDP was merged with IPPS. [Ston93] Stone, H.S., High-Performance Computer Architecture, Addison-Wesley, 1993. [TCom] IEEE Trans. Computers, Journal published by IEEE Computer Society; has occasional special issues on parallel and distributed processing (Apr. 1987, Dec. 1988, Aug. 1989, Dec. 1991, Apr. 1997, Apr. 1998). [TPDS] IEEE Trans. Parallel & Distributed Systems, Journal published by IEEE Computer Society. [Varm94] Varma, A. and C.S. Raghavendra, Interconnection Networks for Multiprocessors and Multicomputers: Theory and Practice, IEEE Computer Society Press, 1994. [Zoma96] Zomaya, A.Y. (Ed.), Parallel and Distributed Computing Handbook, McGraw-Hill, 1996. Return to: Top of this page Contents
at a Glance
The structure of this book in parts, half-parts, and chapters.
Return to: Top of this page Complete Table of Contents Note: Each chapter ends with problems and references. Part I Fundamental Concepts1 Introduction to Parallelism
2 A Taste of Parallel Algorithms
3 Parallel Algorithm Complexity
4 Models of Parallel Processing
Part II Extreme Models5 PRAM and Basic Algorithms
6 More Shared-Memory Algorithms
7 Sorting and Selection Networks
8 Other Circuit-Level Examples
Part III Mesh-Based Architectures9 Sorting on a 2-D Mesh or Torus
10 Routing on a 2-D Mesh or Torus
11 Numerical 2-D Mesh Algorithms
12 Other Mesh-Related Architectures
Part IV Low-Diameter Architectures13 Hypercubes and Their Algorithms
14 Sorting and Routing on Hypercubes
15 Other Hypercubic Architectures
16 A Sampler of Other Networks
Part V Some Broad Topics17 Emulation and Scheduling
18 Data Storage, Input, and Output
19 Reliable Parallel Processing
20 System and Software Issues
Part VI Implementation Aspects21 Shared-Memory MIMD Machines
22 Message-Passing MIMD Machines
23 Data-Parallel SIMD Machines
24 Past, Present, and Future
IndexReturn to: Top of this page From the Preface to the Instructor’s ManualsThis instructor’s manual consists of two volumes. Volume 1 presents solutions to selected problems and includes additional problems (many with solutions) that did not make the cut for inclusion in the text Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999) or that were designed after the book went to print. Volume 2 contains enlarged versions of the figures and tables in the text as well as additional material, presented in a format that is suitable for use as transparency masters. The winter 2002 edition Volume 1, which consists of the following parts, is available to qualified instructors through the publisher: Volume 1 Part I Selected solutions and additional problems Part II Question bank, assignments, and projects The winter 2002 edition of Volume 2, which consists of the following parts, is available as a large downloadable through the book’s Web page (see Presentations below): Volume 2 Parts I-VI Lecture slides and other presentation material The book’s Web page, given below, also contains an errata and a host of other material (please note the upper-case “F” and “P” and the underscore symbol after “text” and “par”): http://www.ece.ucsb.edu/Faculty/Parhami/text_par_proc.htm The author would appreciate the reporting of any error in the textbook or in this manual, suggestions for additional problems, alternate solutions to solved problems, solutions to other problems, and sharing of teaching experiences. Please e-mail your comments to or send them by regular mail to the author’s postal address:
Department of Electrical and Computer Engineering
University of California
Santa Barbara, CA 93106-9560, USA Contributions will be acknowledged to the extent possible. Return to: Top of this page Presentations The following versions of the Instructor's Manual, Vol. 2, are available for downloading:
The following PowerPoint and pdf presentation files (prepared in 2005 and last updated in spring 2006, 1-2 MB each) supersede the manuals just listed:
Return to: Top of this page Errors
and Additions The following additions and corrections are presented in the order of page numbers in the book. Each entry is assigned a unique code, which appears in square brackets and begins with the date of posting or last modification in the format "yymmdd". As errors are removed in future printings, the codes in square brackets will be augmented to include the printing number which no longer contains the error. Note: To find out which printing of the book you own, look at the bottom of the copyright page; there, you see a sequence of numbers "10 9 8 . . .", the last of which is the printing.
p. 10 [010410c] In Fig. 1.5, lower right corner, the label "(b)" is redundant and should be removed. p. 13 [010410d] In the right panel of Fig. 1.9, replace "Real" with "Actual". p. 21
[030220] In Problem 1.1, part a, replace “MIPS rating” by
“IPS rating”. p. 21
[001212] In Problem 1.5, replace “For each of the values, 1, 10, and 100 for the
parameter b, determine the range of
values for c where . . .” by
“Determine the range of values for b
and c where . . .”. p. 23
[001212a] Add [Bokh87] to the reference list for Chapter 1. [Bokh87] Bokhari, S., “Multiprocessing the Sieve of Eratosthenes,” IEEE Computer, Vol. 20, No. 4, pp. 50-58, April 1987. p. 50
[001212b] Table 3.3, leftmost column, last entry: align “1” with the numbers
above it. p. 50
[001212c] Table 3.3, column headed 100 sqrt(n): the first entry should be 5
min (not .5 min). p. 50
[010410a] Table 3.3, column headed 100 sqrt(n): the next to the
last entry should be 9 hr (not 1 yr). p. 50
[010410b] Table 3.3, column headed 100 sqrt(n): the last entry should be
1 day (not 3 yr). p. 55
[001212d] Evidence for the suspicion that NC ¹ P fall into three
categories [Gree95, pp. 61-70]: a.
The best general simulations of sequential models yield
modest speed-ups of log T or sqrt(T), where T is the
sequential time. Furthermore, these simulations need 2^(T^W(1)),
or exponentially many, processors. A fast general simulation would allow us to
run sequential programs in parallel with good speed-up. Such a “super”
parallelizing compiler is impractical. b.
All known fast simulations (parallelizations) have very limited domains
of applicability. For example, we know that certain machines with limited
resources (e.g., time and space) can recognize languages that are in NC but that
the resources must increase exponentially to cover all languages in P. Hence, NC
= P would mean that this exponential increase in resources does not add to the
machine’s computing power. This is quite unlikely. c.
Certain natural approaches to parallelization have been
proven to be insufficient. For
a list of P-complete problems, see [Gree95, pp. 119-220]. Open problems, that
are not known to be P-complete or in NC, are listed in [Gree95, pp. 221-243]. [Gree95]
Greenlaw, R., H.J. Hoover, and W.L. Ruzzo, Limits to Parallel Computation:
P-Completeness Theory, Oxford Univ. Press, 1995. p. 59 [010330] Near the bottom of the page, the solution given for g(n) = g(n - 1) + 1/2 is incorrect; the correct solution, (n - 1)/2, makes the answer different from that obtained from unrolling. The problem stems from the fact that the assumptions f(1) = 0 and g(1) = 0 are incompatible, given the definition f(n) = an^2 + g(n). The simplest fix is to replace "assuming g(1) = 0" on the third line from the bottom of the page with "assuming g(2) = 0". An alternate fix is to write f(n) = an^2 + g(n) - a on line 9 from the bottom, arguing that this form makes f(1) = g(1) = 0 possible. p. 79
[020104] Before last paragraph add: Both LogP and BSP model the network bandwidth
limitation on a per-processor basis. It is also possible to impose a global
bandwidth limit (of, say, p/g). This can be shown to lead to a fundamentally
stronger model [Adle99] that should not be used if the hardware in fact imposes a
per-processor limit.
[Adle99] Adler, M., P.B.
Gibbons, Y. Matias, and V. Ramachandran, “Modeling Parallel Bandwidth: Local
versus Global Restrictions,” Algorithmica,
Vol. 24, pp. 381-404, 1999. p. 97
[030220a] On lines 2 and 3 from bottom, replace “having more processors
than items” by
“having fewer processors than items”. p. 103
[001212f] Line 3: Change “m^2 processors” to “m^2 processors” (italicize m). p. 124
[010410] Lines 11-12 are to be slightly modified as follows: This is a butterfly network that we will encounter again in
Chapter 8, where we devise a circuit for computing the fast Fourier transform (FFT),
and in Chapter 15, (insert two commas and remove "again") p. 125
[001212g] At the end of Section 6.6, add: For a good discussion of hot-spot
contention, see [Dand99].
[Dand99] Dandamudi, S.P.,
“Reducing Hot-Spot Contention in Shared-Memory Multiprocessor Systems,” IEEE
Concurrency, Vol. 7, No. 1, pp. 48-59, January-March 1999. p. 127 [020123] Replace the one-line introduction of Problem 6.14 with the following: Consider the butterfly memory access network depicted in Fig. 6.9, but with only eight processors and eight memory banks, one connected to each circular node in columns 0 and 3 of the butterfly. p. 135 [010501] On line 7 in the first paragraph of Section 7.3, n floor(n/2) should be n(n - 1)/2. p. 138 [010524] Lines 6-7: The 0s and 1s on line 6, and the w_i terms right below them, should be vertically aligned to appear halfway between the 0s and 1s on line 5. p. 138 [010524a] Last line: The rightmost symbol in the equation for the cost C(n) of Batcher's even-odd-merge sorter should be 4, not 2. p. 141 [010919] Becker, Nassimi, and Perl [Beck98] provide a generalization of periodic sorting networks of Dowd et al [Dowd89] that leads to a large class of similarly structured networks with identical latencies. The networks of Dowd et al represent one extreme (the most complex) in this class. The networks at the other extreme (the simplest) have n/2 - 1 fewer comparators in each of their log_2 n stages.
[Beck98] Becker,
R.I., D. Nassimi, and Y. Perl, “The New Class of g-Chain Periodic Sorters,” J.
Parallel and Distributed Computing, Vol. 54, pp. 206-222, 1998. p. 209
[010105] In re to Problem 10.14, a comprehensive survey of interval
routing appears in [Gavo00].
[Gavo00] Gavoille, C.,
“A Survey of Interval Routing,” Theoretical Computer Science, Vol.
245, pp. 217-253, 28 Aug. 2000. p. 213
[001212h] The following diagram complements Fig. 11.1 (shows how Fig. 11.1 was
derived).
p. 216
[010516] Replace “back substitution” with “forward substitution”
throughout the discussion and in the algorithm title. After the definition of
forward substitution, following “... and so forth” add: (the corresponding
method for upper triangular systems is called back
substitution, because evaluation goes backward from x_n–1 to x_0). p. 220 [020308a] In line 2 after Figure 11.10,change "theta(sqrt(p))" to "theta(sqrt(m))"; that is, replace p with m. p. 229 [001212j] In the caption to Figure 11.20, change "Lavialdi's algorithm" to "Levialdi's algorithm in the shrinkage phase" (note spelling correction). p. 230
[001212k] In the caption to Figure 11.21, change "Lavialdi" to
"Levialdi". p. 231 [001212m] In lines 9 and 22, change "Lavialdi" to "Levialdi". p. 232 [010516a] Change "Back substitution" in the title and first line of Problem 11.4 to "Forward substitution". p. 233
[001212n] In the title to Problem 11.13 and in parts a, c, and d, change
"Lavialdi" to "Levialdi". p. 233 [001212p] In the reference labeled [Lavi72], change "Lavialdi" to "Levialdi". Also, change the reference label to [Levi72]. p. 233
[001212q] Add [Math87] to the reference list for Chapter 11. {Contains a good
exposition of matrix algorithms, including LU factorization and its comparison
with Gaussian elimination.} [Math87] Mathews, J.H., Numerical Methods for Computer Science, Engineering, and Mathematics, Prentice-Hall, 1987. p. 248 [010521] Line 8: Change the exponent of 2 from 2l - 2 to l - 2; i.e., the number of processors in an l-level mesh of trees is p = (2^l)(3 ´ 2^(l-2) - 1). p. 253 [050523] Add the following explanation to part b of Problem 12.1: The restriction of interprocessor communication to a single dimension at a time for all processors is known as the uniaxis communication model. p. 257
[001212r] Line 8: Change “q-D” to “qD”. p. 266
[020308b] In line 10 after Fig. 13.3, replace "a torus" with
"a 2D torus". p. 266 [020308c] In the last two lines before Fig. 13.4, remove the open parenthesis before 2^m_0 and the closed parenthesis after 2^m_h-1. p. 276 [020308] At the end of Problem 13.6, change "from 0 to 2^q - 2" to "from 1 to 2^q - 1". {The original statement is not incorrect, but the modified statement makes the solution somewhat cleaner.} p. 277
[010521a] Part a of Problem 13.13: Change “back substitution” to “forward
substitution”. p. 290 [001212s] Fig. 14.7 caption should read: Packing is a “good” problem for dimension-order routing on the hypercube. p. 291
[001212t]
Fig. 14.8 caption
should read: Bit-reversal permutation is a “bad” problem for dimension-order
routing on the hypercube. p. 305
[001212u] Line 7: 0(2q)
should be o(2q);
i.e., change “0” to lower case roman “oh”. p. 312 [001212v] The following diagram is a graphical representation of the communication pattern in ascend, descend, and normal algorithms.
p. 314 [010521b] On the next to the last line (above Fig. 15.17), at the beginning of the line, "011011" should be replaced by "01011011". p. 319 [030220b] In Problem 15.11, part d, italicize the last occurrence of x on the last line. p. 328 [050523a] On line 6, the diameter of star graph should be given as ë3(q - 1)/2û, instead of 2q - 3. p. 328 [050523b] On line 7, Change "optimal" to "suboptimal" and delete the sentence beginning with "Note, however". p. 341 [050523c] On line 4 of the introduction to Problem 16.4, change "the other 2q bits" to "the other 2q - 2 bits". p. 343
[001213] Add [Lisz97] to the reference list for Chapter 16. {Good overview paper,
suggesting that it is difficult to compare networks fairly. We don’t know if
analytically tractable models can predict real behavior. It is not at all clear
how diverse networks used within distinct architectures can be compared with
regard to cost-effectiveness.} [Lisz97] Liszka, K.J., J.K. Antonio, and H.J. Siegel, “Problems with Comparing Interconnection Networks: Is an Alligator Better than an Armadillo?”, IEEE Concurrency, Vol. 5, No. 4, pp. 18-28, October–December 1997. p. 350 [010601] Line 2 from bottom of page: Change "five-node graph" to "four-node graph". p. 361 [010601a] Line 16: Change "upper bound for" to "approximation to". p. 361 [010521c] Line 17: In the formula for speed-up, change "£ =" to "»". p. 366 [040105] In part a of Problem 17.12, change "scheduling theorem" to "scheduling algorithm". p. 376 [020311] In Fig. 18.4, change the label of the topmost horizontal arrow, going from the "Exclusive" state to the "Shared" state, to: Read miss: Fetch data value, return data value, sharing set = sharing set + {c} p. 378 [001213a] Before the last paragraph of Section 18.3, add: Within certain limits, multithreading may have the same effect on performance as parallel processing. The results of [Sato99] show that for some numerical applications, performance depends on the product of the number of nodes and the number of threads. For example, in multiplying large matrices, the speed-ups achieved are almost identical with 2 nodes at 4 threads, 4 nodes at 2 threads, and 8 nodes at a single thread. [Sato99] Sato, M., “COMPaS: A PC-Based SMP Cluster,” IEEE Concurrency, Vol. 7, No. 1, pp. 82-86, January-March 1999. p. 398 [001213b] At the end of the next to last paragraph, add: However, if we are allowed to change the external connections to the array, so that either the top or bottom row (leftmost or rightmost column) can be viewed as a spare row (column), then all triple defects and a larger fraction of other defect patterns become tolerable. p. 404 [001213c] At the end of the first full paragraph, add: For a review of algorithm-based fault tolerance methods, see [Vija97]. [Vija97] Vijay, M. and R. Mittal, “Algorithm-Based Fault Tolerance: A Review,” Microprocessors and Microsystems, Vol. 21, pp. 151-161, 1997. p. 405 [050523d] On line 31, change "P4 being the only healthy unit" to "P3 being the only healthy unit". p. 418 [001213d] After the description of Fig. 20.2 in Section 20.1, add: For a discussion of hot-spot contention in shared-memory systems, see the end of Section 6.6. p. 427
[001213e] Section 20.3 – For updates on PVM see:
www.netlib.org/pvm3/book/node1.html
p. 435 [001213f] Part (c) of problem 20.9: Change “with” to “WITH”. p. 457 [001213g] Add [Dahl99] to the reference list for Chapter 21. {Reviews simple, hierarchical, and flat COMA architectures and qualitatively compares them to NUMA variants. Concludes that a combined architecture allowing NUMA/COMA handling of data on a per-page basis may be the best solution.}
[Dahl99]
Dahlgren, F. and J. Torrellas, “Cache-Only Memory Architectures”, IEEE Computer, Vol. 32, No. 6, pp. 72-79, June 1999. p. 464
[001213h] An excellent introduction to Myricom’s Myrinet switch can be found in
[Bode95].
[Bode95]
Boden, N.J., et al., “Myrinet: A Gigabit-per-Second Local Area Network,” IEEE Micro, Vol. 15, No. 1, pp. 29-36, Feb. 1995. p. 508 [020104a] For an update on the course of the ASCI program, including data on the 3 TFLOPS Blue Pacific (IBM, at Lawrence Livermore Lab), the 3 TFLOPS Blue Mountain (SG/Cray, at Los Alamos), and the 12 TFLOPS ASCI White (IBM), see [Clar98]. ASCI White is also described and depicted in the Nov. 2001 issue of IEEE Spectrum, pp. 29-31.
[Clar98]
Clark, D., “ASCI Pathforward: To 30 Tflops and Beyond,” IEEE Concurrency, Vol.
6, No. 2, pp. 13-15, Apr.-June 1998. p. 508
[001213i] An overview of the Intel Pentium Pro project can be found in [Papw96].
[Papw96]
Papworth, D.B., “Tuning the Pentium Pro Microarchitecture,” IEEE
Micro, Vol. 16, No. 2, pp. 8-15, Apr. 1996.
For information of a
competing processor architecture, the MIPS R10000, see [Yeag96].
[Yeag96]
Yeager, K.C., “The MIPS R10000 Superscalar Microprocessor,” IEEE
Micro, Vol. 16, No. 2, pp. 28-40, Apr. 1996.
Return to: Top of this page || Go up to: B. Parhami's teaching and textbooks or his home page
|
|