## Instructor's Manual Vol. 2: Presentation Material

Pleroun Series in Compuler Science

# Introduction to Parallel Processing 

Algorithms and Architectures


Behrooz Parhami

This instructor's manual is for
Introduction to Parallel Processing: Algorithms and Architectures, by Behrooz Parhami, Plenum Series in Computer Science (ISBN 0-306-45970-1, QA76.58.P3798) © 1999 Plenum Press, New York (http://www.plenum.com)

All rights reserved for the author. No part of this instructor's manual may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at: ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA (parhami@ece.ucsb.edu)

## Preface to the Instructor's Manual

This instructor's manual consists of two volumes. Volume 1 presents solutions to selected problems and includes additional problems (many with solutions) that did not make the cut for inclusion in the text Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999) or that were designed after the book went to print. It also contains corrections and additions to the text, as well as other teaching aids. The spring 2000 edition of Volume 1 consists of the following parts (the next edition is planned for spring 2001):

Vol. 1: Problem Solutions
Part I Selected Solutions and Additional Problems
Part II Question Bank, Assignments, and Projects
Part III Additions, Corrections, and Other Updates
Part IV Sample Course Outline, Calendar, and Forms
Volume 2 contains enlarged versions of the figures and tables in the text, in a format suitable for use as transparency masters. It is accessible, as a large postscript file, via the author's Web address: http://www.ece.ucsb.edu/faculty/parhami

Vol. 2: Presentation Material
Parts I-VI Lecture slides for Parts I-VI of the text
The author would appreciate the reporting of any error in the textbook or in this manual, suggestions for additional problems, alternate solutions to solved problems, solutions to other problems, and sharing of teaching experiences. Please e-mail your comments to
parhami@ece.ucsb.edu
or send them by regular mail to the author's postal address:
Department of Electrical and Computer Engineering University of California
Santa Barbara, CA 93106-9560, USA
Contributions will be acknowledged to the extent possible.
Behrooz Parhami
Santa Barbara, California, USA
April 2000

| Book |  | Book Parts | Half-Parts | Chapters |
| :---: | :---: | :---: | :---: | :---: |
|  |  | Part I: <br> Fundamental Concepts | Background and Motivation <br> Complexity and Models | 1. Introduction to Parallelism <br> 2. A Taste of Parallel Algorithms <br> 3. Parallel Algorithm Complexity <br> 4. Models of Parallel Processing |
|  |  | Part II: <br> Extreme <br> Models | Abstract View of Shared Memory <br> Circuit Model of Parallel Systems | 5. PRAM and Basic Algorithms <br> 6. More Shared-Memory Algorithms <br> 7. Sorting and Selection Networks <br> 8. Other Circuit-Level Examples |
|  |  | Part III: <br> Mesh-Based <br> Architectures | Data Movement on 2D Arrays <br> Mesh Algorithms and Variants | 9. Sorting on a 2D Mesh or Torus 10. Routing on a 2D Mesh or Torus <br> 11. Numerical 2D Mesh Algorithms <br> 12. Other Mesh-Related Architectues |
|  |  | Part IV: <br> Low-Diameter Architectures | The Hypercube Architecture <br> Hypercubic and Other Networks | 13. Hypercubes and Their Algorithms <br> 14. Sorting and Routing on Hypercubes <br> 15. Other Hypercubic Architectures <br> 16. A Sampler of Other Networks |
|  |  | Part V: <br> Some Broad Topics | Coordination and Data Access <br> Robustness and Ease of Use | 17. Emulation and Scheduling <br> 18. Data Storage, Input, and Output <br> 19. Reliable Parallel Processing <br> 20. System and Software Issues |
|  |  | Part VI: <br> Implementation Aspects | Control-Parallel Systems <br> Data Parallelism and Conclusion | 21. Shared-Memory MIMD Machines <br> 22. Message-Passing MIMD Machines <br> 23. Data-Parallel SIMD Machines <br> 24. Past, Present, and Future |

## The structure of this book in parts, half-parts, and chapters.

Table of Contents, Vol. 2
Preface to the Instructor's Manual ..... iii
Part I Fundamental Concepts ..... 1
1 Introduction to Parallelism ..... 2
2 A Taste of Parallel Algorithms ..... 9
3 Parallel Algorithm Complexity ..... 17
4 Models of Parallel Processing ..... 21
Part II Extreme Models ..... 28
5 PRAM and Basic Algorithms ..... 29
6 More Shared-Memory Algorithms ..... 35
7 Sorting and Selection Networks ..... 40
8 Other Circuit-Level Examples ..... 49
Part III Mesh-Based Architectures ..... 57
9 Sorting on a 2D Mesh or Torus 58
10 Routing on a 2-D Mesh or Torus ..... 69
11 Numerical 2-D Mesh Algorithms ..... 77
12 Other Mesh-Related Architectures ..... 88
Part IV Low-Diameter Architectures ..... 97
13 Hypercubes and Their Algorithms ..... 98
14 Sorting and Routing on Hypercubes ..... 106
15 Other Hypercubic Architectures ..... 113
16 A Sampler of Other Networks ..... 125
Part V Some Broad Topics ..... 133
17 Emulation and Scheduling ..... 134
18 Data Storage, Input, and Output ..... 140
19 Reliable Parallel Processing ..... 147
20 System and Software Issues ..... 154
Part VI Implementation Aspects ..... 158
21 Shared-Memory MIMD Machines ..... 159
22 Message-Passing MIMD Machines ..... 168
23 Data-Parallel SIMD Machines ..... 177
24 Past, Present, and Future ..... 188

## Part I Fundamental Concepts

## Part Goals

- Motivate us to study parallel processing
- Paint the big picture
- Provide background in the three Ts:

Terminology/Taxonomy
Tools - for evaluation or comparison
Theory - easy and hard problems

## Part Contents

- Chapter 1: Introduction to Parallelism
- Chapter 2: A Taste of Parallel Algorithms
- Chapter 3: Parallel Algorithm Complexity
- Chapter 4: Models of Parallel Processing


## 1 Introduction to Parallelism

Chapter Goals

- Set the context in which the course material will be presented
- Review challenges that face the designers and users of parallel computers
- Introduce metrics for evaluating the effectiveness of parallel systems

Chapter Contents

- 1.1. Why Parallel Processing?
- 1.2. A Motivating Example
- 1.3. Parallel Processing Ups and Downs
- 1.4. Types of Parallelism: A Taxonomy
- 1.5. Roadblocks to Parallel Processing
- 1.6. Effectiveness of Parallel Processing


### 1.1 Why Parallel Processing?



Fig. 1.1. The exponential growth of microprocessor performance, known as Moore's Law, shown over the past two decades.

Factors contributing to the validity of Moore's law
Denser circuits
Architectural improvements
Measures of processor performance
Instructions per second (MIPS, GIPS, TIPS, PIPS)
Floating-point operations per second
(MFLOPS, GFLOPS, TFLOPS, PFLOPS)
Running time on benchmark suites

There is a limit to the speed of a single processor (the speed-of-light argument)

Light travels $30 \mathrm{~cm} / \mathrm{ns}$; signals on wires travel at a fraction of this speed If signals must travel 1 cm in an instruction cycle, 30 GIPS is the best we can hope for


Fig. 1.2. The exponential growth in supercomputer per-formance over the past two decades [Bell92].

## The need for TFLOPS

Modeling of heat transport to the South Pole in the southern oceans [Ocean model: 4096 E-W regions $\times 1024 \mathrm{~N}-\mathrm{S}$ regions $\times 12$ layers in depth]

30000000000 FLOP per 10-min iteration $\times$ 300000 iterations per six-year period = $10^{16}$ FLOP

Fluid dynamics
$1000 \times 1000 \times 1000$ lattice $\times$
1000 FLOP per lattice point $\times 10000$ time steps $=$ $10^{16}$ FLOP

Monte Carlo simulation of nuclear reactor
100000000000 particles to track (for $\approx 1000$ escapes) $\times$ 10000 FLOP per particle tracked =
$10^{15}$ FLOP
Reasonable running time $=$
Fraction of hour to several hours $\left(10^{3}-10^{4} \mathrm{~s}\right)$
Computational power =
$10^{16} \mathrm{FLOP} / 10^{4} \mathrm{~s}$ or $10^{15} \mathrm{FLOP} / 10^{3} \mathrm{~s}=10^{12} \mathrm{FLOPS}$

Why the current quest for PFLOPS?
Same problems, perhaps with finer grids or longer simulated times

## ASCI: Advanced Strategic Computing Initiative, US Department of Energy



Fig. 24.1. Performance goals of the ASCI program.

## Status of Computing Power (circa 2000)

GFLOPS on desktop
Apple Macintosh, with G4 processor

## TFLOPS in supercomputer center

1152-processor IBM RS/6000 SP
uses a switch-based interconnection network
see IEEE Concurrency, Jan.-Mar. 2000, p. 9
Cray T3E, torus-connected

PFLOPS on drawing board
1M-processor IBM Blue Gene (2005?)
see IEEE Concurrency, Jan.-Mar. 2000, pp. 5-9
32 proc's/chip, 64 chips/board, 8 boards/tower, 64 towers
Processor: 8 threads, on-chip memory, no data cache
Chip: defect-tolerant, row/column rings in a $6 \times 6$ array
Board: $8 \times 8$ chip grid organized as $4 \times 4 \times 4$ cube
Tower: Each board linked to 4 neighbors in adjacent towers
System: $32 \times 32 \times 32$ cube of chips, 1.5 MW (water-cooled)

### 1.2 A Motivating Example

Sieve of Eratosthenes ('er-a-'taas-tha-neez) for finding all primes in [1, n]


Fig. 1.3. The sieve of Eratosthenes yielding a list of 10 primes for $n=30$. Marked elements have been distinguished by erasure from the list.


Fig. 1.4. Schematic representation of singleprocessor solution for the sieve of Eratosthenes.


Fig. 1.5. Schematic representation of a controlparallel solution for the sieve of Eratosthenes.


Fig. 1.6. Control-parallel realization of the sieve of Eratosthenes with $\mathbf{n}=1000$ and $1 \leq p \leq 3$.

## $P_{1}$ finds each prime and broadcasts it to all other processors (assume $n / p \leq \sqrt{n}$ )



Fig. 1.7. Data-parallel realization of the sieve of Eratosthenes.

## Some reasons for sublinear speed-up <br> Communication overhead




Fig. 1.8. Trade-off between communication time and computation time in the data-parallel realization of the sieve of Eratosthenes.

## Input/output overhead




Fig. 1.9. Effect of a constant $\mathrm{I} / \mathrm{O}$ time on the dataparallel realization of the sieve of Eratosthenes.

### 1.3 Parallel Processing Ups and Downs

Early 1900s: 1000s of "computers" (humans + calculators)


Fig. 1.10. Richardson's circular theater for weather forecasting calculations.

Parallel processing is used in virtually all computers
Compute-I/O overlap, pipelining, multiple function units But ... in this course we use "parallel processing" in a stricter sense implying the availability of multiple CPUs.
1960s: ILLIAC IV (U Illinois) - 4 quadrants, each $8 \times 8$ mesh 1980s: Commercial interest resurfaced; technology was driven by governement contracts. Once funding dried up, many companies went bankrupt.
2000s: The Internet revolution - info providers, multimedia, data mining, etc. need extensive computational power

### 1.4 Types of Parallelism: A Taxonomy



Fig. 1.11. The Flynn-Johnson classification of computer systems.

Why are computer architects so fascinated by four-letter acronyms and abbreviations?

RISC, CISC, PRAM, NUMA, VLIW
JPDC, TPDS
ICPP, IPPS, SPDP, SPAA
My contribution:
SINC: Scant/Simple Interaction Network Cell
FINC: Full Interaction Network Cell

### 1.5 Roadblocks to Parallel Processing

a. Grosch's law (economy of scale applies, or computing power is proportional to the square of cost)
b. Minsky's conjecture (speedup is proportional to the logarithm of the number p of processors)
c. Tyranny of IC technology (since hardware becomes about 10 times faster every 5 years, by the time a parallel machine with 10-fold performance is built, uniprocessors will be just as fast)
d. Tyranny of vector supercomputers (vector supercomputers are rapidly improving in performance and offer a familiar programming model and excellent vectorizing compilers; why bother with parallel processors?)
e. The software inertia (Billions of dollars worth of existing software makes it hard to switch to parallel systems)
f. Amdahl's law
a small fraction $f$ of inherently sequential or unparallelizable computation severely limits the speed-up)



Fig. 1.12. The limit on speed-up according to Amdahl's law.

### 1.6 Effectiveness of Parallel Processing



Fig. 1.13. Task graph exhibiting limited inherent parallelism.

Measures used in this course to compare parallel architectures and algorithms [Lee80]:
p Number of processors
W(p) Total number of unit operations performed by the $p$ processors; computational work or energy
$\mathrm{T}(\mathrm{p}) \quad$ Execution time with p processors;

$$
T(1)=W(1) \text { and } T(p) \leq W(p)
$$

$\mathrm{S}(\mathrm{p})$ Speedup $=\frac{\mathrm{T}(1)}{\mathrm{T}(\mathrm{p})}$
$\mathrm{E}(\mathrm{p})$ Efficiency $=\frac{\mathrm{T}(1)}{\mathrm{pT}(\mathrm{p})}$
$\mathrm{R}(\mathrm{p}) \quad$ Redundancy $=\frac{\mathrm{W}(\mathrm{p})}{\mathrm{W}(1)}$
$\mathrm{U}(\mathrm{p})$ Utilization $=\frac{\mathrm{W}(\mathrm{p})}{\mathrm{pT}(\mathrm{p})}$
$Q(\mathrm{p})$ Quality $\quad=\frac{\mathrm{T}^{3}(1)}{\mathrm{p}^{2}(\mathrm{p}) \mathrm{W}(\mathrm{p})}$
Relationships among the preceding measures:

$$
\begin{aligned}
& 1 \leq S(p) \leq p \\
& E(p)=\frac{S(p)}{p} \\
& \frac{1}{p} \leq E(p) \leq U(p) \leq 1 \\
& Q(p) \leq S(p) \leq p
\end{aligned}
$$

Example: Adding 16 numbers, assuming unit-time additions and ignoring all else, with $p=8$


Fig. 1.14. Computation graph for finding the sum of 16 numbers.

Zero-time communication: $\quad \mathrm{W}(8)=15 \quad \mathrm{~T}(8)=4$
$E(8)=15 /(8 \times 4)=47 \%$
$S(8)=15 / 4=3.75 \quad R(8)=15 / 15=1 \quad Q(8)=1.76$
Unit-time communication: $\quad \mathrm{W}(8)=22 \quad \mathrm{~T}(8)=7$
$E(8)=15 /(8 \times 7)=27 \%$
$S(8)=15 / 7=2.14 \quad R(8)=22 / 15=1.47 \quad Q(8)=0.39$

## 2 A Taste of Parallel Algorithms

Chapter Goals

- Consider five basic building-block parallel operations
- Implement them on four simple parallel architectures
- Learn about the nature of parallel computations, complexity analysis, and the interplay between algorithm and architecture

Chapter Contents

- 2.1. Some Simple Computations
- 2.2. Some Simple Architectures
- 2.3. Algorithms for a Linear Array
- 2.4. Algorithms for a Binary Tree
- 2.5. Algorithms for a 2D Mesh
- 2.6. Algorithms with Shared Variables


### 2.1 Some Simple Computations



Fig. 2.1. Semigroup computation on a uniprocessor.


Semigroup computation viewed as a tree or fan-in computation.


Prefix computation on a uniprocessor.
3. Packet routing one processor sending a packet of data to another
4. Broadcasting
one processor sending a packet of data to all others
5. Sorting
processors cooperating in rearranging their data into desired order

### 2.2 Some Simple Architectures



Fig. 2.2. A linear array of nine processors and its ring variant.

Diameter of linear array: $D=p-1$
(Max) Node degree: $\mathrm{d}=2$


Fig. 2.3. A balanced (but incomplete) binary tree of nine processors.

Diameter of balanced binary tree: $\mathrm{D}=2\left\lfloor\log _{2} \mathrm{p}\right\rfloor$; or one less (Max) Node degree: $d=3$
We almost always deal with complete binary trees:
$p$ one less than a power of $2 \quad D=2 \log _{2}(p+1)-2$


Fig. 2.4. A 2D mesh of nine processors and its torus variant.

Diameter of $r \times(p / r)$ mesh: $D=r+p / r-2$
(Max) Node degree: $d=4$
Square meshes preferred; they minimize D $(=2 \sqrt{p}-2)$


Fig. 2.5. A shared-variable architecture modeled as a complete graph.

Diameter of complete graph: $\mathrm{D}=1$
(Max) Node degree: $d=p-1$

### 2.3 Algorithms for a Linear Array



Fig. 2.6. Maximum-finding on a linear array of nine processors.


Fig. 2.7. Computing prefix sums on a linear array of nine processors.

Diminished prefix computation: the ith result excludes the ith element (e.g., sum of the first $\mathrm{i}-1$ elements)


Fig. 2.8. Computing prefix sums on a linear array with two items per processor.

## Packet routing or broadcasting: <br> right- and left-moving packets have no conflict



Fig. 2.9. Sorting on a linear array with the keys input sequentially from the left.


Fig. 2.10. Odd-even transposition sort on a linear array.

## For odd-even transposition sort:

Speed-up $=O(p \log p) / p=O(\log p)$
Efficinecy $=O((\log p) / p)$
Redundancy $=O(p /(\log p))$
Utilization $=1 / 2$

### 2.4 Algorithms for a Binary Tree



Upward
Propagation


Fig. 2.11. Parallel prefix computation on a binary tree of processors.

## Some applications of the parallel prefix computation

Finding the rank of each 1 in a list of 0 s and 1 s :

| Data | $:$ | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| :---: | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| sums | $:$ | 0 | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 4 | 5 | 5 |
| of 1 s | $:$ |  |  | 1 |  | 2 |  |  | 3 | 4 | 5 |  |

Priority circuit:

| Data | $:$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{0}$ |
| ---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Diminished prefix ORs : | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ |  |
| Complement | $:$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{1}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ |
| AND with data | $:$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{1}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ | $\mathbf{0}$ |

Carry computation in fast adders
Let " g ", " p ", and " a " denote the event that a particular digit position in the adder generates, propagates, or annihilates a carry. The input data for the carry circuit consists of a vector of three-valued elements such as:


Parallel prefix computation using the carry operator " $\varnothing$ "
$\mathrm{pc} \mathbf{x}=\mathrm{x} \quad \mathrm{x}$ propagates over p, for all $\mathrm{x} \in\{\mathrm{g}, \mathrm{p}, \mathrm{a}\}$
$a c x=a \quad x$ is annihilated or absorbed by a
$g \mathrm{c} x=\mathrm{g} \quad \mathrm{x}$ is immaterial; a carry is generated

## Packet routing on a tree



A balanced binary tree with preorder node indices.
maxl (maxr) = largest node number in the left (right) subtree
if $\quad$ dest $=$ self
then remove the packet \{done\}
else if dest < self or dest > maxr
then route upward
else if dest $\leq \max$
then route leftward
else route rightward endif
endif
endif

## Other indexing schemes might lead to simpler routing algorithms



Broadcasting is done via the root node

Sorting: let the root "see" all data in nondescending order


Fig. 2.12. The first few steps of the sorting algorithm on a binary tree.


Fig. 2.13. The bisection width of a binary tree architecture.

### 2.5 Algorithms for a 2D Mesh



Finding the max value on a 2D mesh


Row prefix sums


Diminished prefix sums in last column


Broadcast in rows and combine

Computing prefix sums on a 2D mesh
Row-major order required if the operator is not commutative Routing and broadcasting done via row/column operations


Initial values




Snake-like


Phase 2


Top-to-bottom


Left-to-right


Fig. 2.14. The shearsort algorithm on a $3 \times 3$ mesh.

### 2.6 Algorithms with Shared Variables



Fig. 2.5. A shared-variable architecture modeled as a complete graph.

Semigroup computation: each processor read all values in turn and combine
Parallel prefix: processor i read/combine values 0 to $\mathrm{i}-1$
Both of the above are quite inefficient, given the high cost
Packet routing and broadcasting: one step, assuming allport communication
Sorting: rank each element by comparing it to all others, then permute according to ranks



Figure for Problem 2.13.

## 3 Parallel Algorithm Complexity

Chapter Goals

- Review algorithm complexity and various complexity classes
- Introduce the notions of time and time-cost optimality
- Derive tools for analyzing, comparing, and fine-tuning parallel algorithms


## Chapter Contents

- 3.1. Asymptotic Complexity
- 3.2. Algorithm Optimality and Efficiency
- 3.3. Complexity Classes
3.4. Parallelizable Tasks and the NC Class
- 3.5. Parallel Programming Paradigms
- 3.6. Solving Recurrences


### 3.1 Asymptotic Complexity

$f(n)=O(g(n))$ if $\exists c, n_{0}$ such that $\forall n>n_{0}, f(n)<c g(n)$
$\mathrm{f}(\mathrm{n})=\Omega(\mathrm{g}(\mathrm{n}))$ if $\exists \mathrm{c}, \mathrm{n}_{0}$ such that $\forall \mathrm{n}>\mathrm{n}_{0}, \mathrm{f}(\mathrm{n})>\mathrm{c} \mathrm{g}(\mathrm{n})$
$f(n)=\Theta(g(n))$ if $\exists c, c^{\prime}, n_{0}$ such that

$$
\forall n>n_{0}, c g(n)<f(n)<c^{\prime} g(n)
$$


$\mathrm{f}(\mathrm{n})=\mathrm{O}(\mathrm{g}(\mathrm{n}))$



Fig. 3.1. Graphical representation of the notions of asymptotic complexity.

$$
\begin{array}{rlll}
\mathrm{f}(\mathrm{n})=\mathrm{o}(\mathrm{~g}(\mathrm{n})) & < & \text { Growth rate strictly less than } \\
\mathrm{f}(\mathrm{n}) & =\mathrm{O}(\mathrm{~g}(\mathrm{n})) & \leq & \text { Growth rate no greater than } \\
\mathrm{f}(\mathrm{n})=\Theta(\mathrm{g}(\mathrm{n})) & = & \text { Growth rate the same as } \\
\mathrm{f}(\mathrm{n})=\Omega(\mathrm{g}(\mathrm{n})) & \geq & \text { Growth rate no less than } \\
\mathrm{f}(\mathrm{n})=\omega(\mathrm{g}(\mathrm{n})) & > & \text { Growth rate strictly greater than }
\end{array}
$$

Table 3.1. Comparing the Growth Rates of Sublinear and Superlinear Functions $(K=1000$, $M=1000000)$

| Sublinear |  | Linear <br> n | Superlinear |  |
| :---: | :---: | :---: | :---: | :---: |
| $\log ^{2} \mathrm{n}$ | $\sqrt{n}$ |  | $n \log ^{2} \mathrm{n}$ | $\mathrm{n}^{3 / 2}$ |
| 9 | 3 | 10 | 90 | 30 |
| 36 | 10 | 100 | 3.6 K | 1 K |
| 81 | 31 | 1 K | 81 K | 31 K |
| 9 | 100 | 100K | 26 M | 32M |
| 361 | ${ }^{1} \mathrm{~K}$ | 1 M | 361 M | 1000M |

Table 3.2. Effect of Constants on the Growth Rates of Selected Functions Involving Constant Factors ( $K=1000, M=1000000$ )

| n | $\frac{n}{4} \log ^{2} \mathrm{n}$ | $n \log ^{2} \mathrm{n}$ | $100 \sqrt{n}$ | $\mathrm{n}^{3 / 2}$ |
| :---: | :---: | :---: | :---: | :---: |
| 10 | 22 | 90 | 300 | 30 |
| 100 | 900 | ${ }^{31} \mathrm{~K}$ K | 1 K | 1 K |
| 10K | 423K | 1.7 M | 10 K | 1 N |
| 100K | 6 M | 26 M | 32 K | 2M |

Table 3.3. Effect of Constants on the Growth Rates of Selected Functions Using Larger Time Units and Round Figures

| n | $\frac{n}{4} \log ^{2} \mathrm{n}$ | $n \log ^{2} \mathrm{n}$ | $100 \sqrt{n}$ | $\mathrm{n}^{3 / 2}$ |
| :---: | :---: | :---: | :---: | :---: |
| 10 | 20 s | 2 min | 5 min | 30 s |
| 100 | 15 min | 1 hr | 15 min | 15 min |
| 1K | 6 hr | 1 day | 1 hr | 9 hr |
| 100K | 5 days | 10 days | 3 hr 1 yr r | 10 days |
| 1 M | 3 yr | 11 yr | 3 yr | 32 yr |

### 3.2 Algorithm Optimality and Efficiency

$f(n) \quad$ Running time of fastest (possibly unknown) algorithm for solving a problem
$g(n) \quad$ Running time of some algorithm $A \Rightarrow f(n)=O(g(n))$
$h(n) \quad$ Min time for solving the problem $\Rightarrow f(n)=\Omega(h(n))$
$g(n)=h(n) \quad \Rightarrow \quad$ Algorithm $A$ is time-optimal
Redundancy $=$ Utilization $=1 \quad \Rightarrow$ A is cost-time optimal
Redundancy $=$ Utilization $=\Theta(1) \Rightarrow A$ is cost-time efficient


Fig. 3.2. Upper \& lower bounds may tighten over time.


Fig. 3.3. Five times fewer steps does not necessarily mean five times faster.

### 3.3 Complexity Classes



Conceptual view of complexity classes P, NP, NP-complete, and NP-hard.

## Example NP(-complete) problem: the subset sum problem

Given a set of $n$ integers and a target sum s, determine if a subset of the integers in the set add up to s.
This problem looks deceptively simple, yet no one knows how to solve it other than by trying practically all of the $2^{n}$ subsets of the given set.

## Even if each of these trials takes only one picosecond, the problem is virtually unsolvable for $n=100$.

### 3.4 Parallelizable Tasks and the NC Class



Fig. 3.4. A conceptual view of complexity classes and their relationships.

NC (Nick's class, Niclaus Pippenger)
Problems solvable in polylogrithmic time ( $T=O\left(\log ^{k} n\right)$ ) using a polynomially bounded number of processors
Example P-complete problem: the circuit-value problem
Given a logic circuit with known inputs, determine its output.
The circuit-value problem is obvioudly in P , but no general algorithm exists for efficient parallel evaluation of a circuit's output.

### 3.5 Parallel Programming Paradigms

## Divide and conquer

Decompose problem of size n into smaller problems Solve the subproblems independently
Combine subproblem results into final answer
$T(n)=T_{d}(n)+T_{s}+T_{c}(n)$
Decompose Solve in parallel Combine

## Randomization

Often it is impossible or difficult to decompose a large problem into subproblems with equal solution times.
In these cases, one might use random decisions that lead to good results with very high probability.
Example: sorting with random sampling
Other forms of randomization:
Random search
Control randomization
Symmetry breaking

## Approximation

Iterative numerical methods often use approximation to arrive at the solution(s).
Example: Solving linear systems using Jacobi relaxation.
Under proper conditions, the iterations converge to the correct solutions; more iterations $\Rightarrow$ more accurate results

### 3.6 Solving Recurrences

## Solution via unrolling

1. $\quad f(n)=f(n-1)+n \quad\{$ Rewrite $f(n-1)$ as $f((n-1)-1)+n-1\}$
$=f(n-2)+n-1+n$
$=\mathrm{f}(\mathrm{n}-3)+\mathrm{n}-2+\mathrm{n}-1+\mathrm{n}$
...
$=f(1)+2+3+\cdots+n-1+n$
$=n(n+1) / 2-1$
$=\Theta\left(n^{2}\right)$
2. $f(n)=f(n / 2)+1 \quad\{$ Rewrite $f(n / 2)$ as $f((n / 2) / 2+1\}$
$=f(n / 4)+1+1$
$=f(n / 8)+1+1+1$
$=f(n / n)+1+1+1+\cdots+1$
$=\log _{2} n$
$=\Theta(\log n)$
3. $f(n)=2 f(n / 2)+1$
$=4 f(n / 4)+2+1$
$=8 f(n / 8)+4+2+1$
$=n f(n / n)+n / 2+\cdots+4+2+1$
$=\mathrm{n}-1$
$=\Theta(n)$
4. $f(n)=f(n / 2)+n$
$=f(n / 4)+n / 2+n$
$=f(n / 8)+n / 4+n / 2+n$

$$
\begin{aligned}
& =f(n / n)+2+4+\cdots+n / 4+n / 2+n \\
& =2 n-2=\Theta(n)
\end{aligned}
$$

5. $f(n)=2 f(n / 2)+n$

$$
=4 f(n / 4)+n+n
$$

$$
=8 f(n / 8)+n+n+n
$$

$$
\ldots
$$

$$
=n f(n / n)+n+n+n+\cdots+n
$$

$$
----\log _{2 n} n \text { times ------ }
$$

$$
=n \log _{2} n=\Theta(n \log n)
$$

Alternate solution for the recurrence $f(n)=2 f(n / 2)+n$ :
Rewrite the recurrence as $\frac{f(n)}{n}=\frac{f(n / 2)}{n / 2}+1$ and denote $f(n) / n$ by $h(n)$ to convert the problem to Example 2
6. $f(n)=f(n / 2)+\log _{2} n$

$$
=f(n / 4)+\log _{2}(n / 2)+\log _{2} n
$$

$$
=f(n / 8)+\log _{2}(n / 4)+\log _{2}(n / 2)+\log _{2} n
$$

$$
=f(n / n)+\log _{2} 2+\log _{2} 4+\cdots+\log _{2}(n / 2)+\log _{2} n
$$

$$
=1+2+3+\cdots+\log _{2} n
$$

$$
=\log _{2} n\left(\log _{2} n+1\right) / 2=\Theta\left(\log ^{2} n\right)
$$

## Solution via guessing

Guess the solution and verify it by substitution
Substitution also useful to find the constant multiplicative factors and lower-order terms
Example: $f(n)=f(n-1)+n$; guess $f(n)=\Theta\left(n^{2}\right)$
Write $f(n)=a n^{2}+g(n)$, where $g(n)=o\left(n^{2}\right)$
Substituting in the recurrence equation, we get:

$$
a n^{2}+g(n)=a(n-1)^{2}+g(n-1)+n
$$

This equation simplifies to:

$$
g(n)=g(n-1)+(1-2 a) n+a
$$

Choose $a=1 / 2$ to make $g(n)=0\left(n^{2}\right)$ possible

$$
g(n)=g(n-1)+1 / 2=n / 2-1 \quad\{g(1)=0\}
$$

The solution to the original recurrence then becomes

$$
f(n)=n^{2} / 2+n / 2-1
$$

## Solution via a basic theorem

Theorem 3.1 (basic theorem for recurrences): Given
$f(n)=a f(n / b)+h(n) ; a, b$ constant, $h$ an arbitrary function the asymptotic solution to the recurrence is
$\mathrm{f}(\mathrm{n})=\Theta\left(\mathrm{n}^{\log _{\mathrm{b}} \mathrm{a}}\right)$
if $h(n)=O\left(n^{\log _{b} a-\varepsilon}\right)$ for some $\varepsilon>0$
$f(n)=\Theta\left(n^{\log _{b} a} \log n\right)$
if $h(n)=\Theta\left(n^{\log _{b} a}\right)$
$\mathrm{f}(\mathrm{n})=\Theta(\mathrm{h}(\mathrm{n}))$
if $\mathrm{h}(\mathrm{n})=\Omega\left(\mathrm{n}^{\log _{b} a+\varepsilon}\right)$ for some $\varepsilon>0$

## 4 Models of Parallel Processing

Chapter Goals

- Elaborate on the taxonomy of parallel processing from Chapter 1
- Introduce abstract models of shared and distributed memory
- Understand the differences between abstract models and real hardware


## Chapter Contents

- 4.1. Development of Early Models
- 4.2. SIMD versus MIMD Architectures
- 4.3. Global versus Distributed Memory
- 4.4. The PRAM Shared-Memory Model
- 4.5. Distributed-Memory or Graph Models
- 4.6. Circuit Model \& Physical Realizations
4.1 Development of Early Models
Thousands of processors were found in some computersas early as the 1960sThese architectures were variously referred to asassociative memoriesassociative processorslogic-in-memory machines
More recent names areprocessor-in-memory andintelligent RAM
Table 4.1. Entering the Second Half-Century of Associative Processing

| Decade | Events and Advances | Technology | Performance |
| :--- | :--- | :--- | :--- |
| 1940s | Formulation of need \& concept | Relays |  |
| 1950s | Emergence of cell technologies | Magnetic, Cryogenic | Mega-bit-OPS |
| 1960s | Introduction of basic architectures | Transistors |  |
| 1970s | Commercialization \& applications | ICs | Giga-bit-OPS |
| 1980s | Focus on system/software issues | VLSI | Tera-bit-OPS |
| 1990s | Scalable \& flexible architectures | ULSI, WSI | Peta-bit-OPS? |

## Revisiting the Flynn-Johnson classification



Fig. 4.1. The Flynn-Johnson classification of computer systems.

## MISD can be viewed as a flexible (programmable) pipeline



Fig. 4.2. Multiple instruction streams operating on a single data stream (MISD).

### 4.2 SIMD versus MIMD Architectures

Most early parallel machines were of SIMD type Synchronous SIMD

To perform data-dependent conditionals (if-then-else), first processors satisfying the condition are enabled, next the remainder are enabled for the "else" part Critics of SIMD view the above as being wasteful

But: are buses less efficient than private cars, or is your PC hardware wasted when you answer the phone? Asynchronous SIMD = SPMD
Custom- versus commodity-chip SIMD

Most recent parallel machines are MIMD-type
MPP: massively or moderately parallel processor?
Tight versus loose coupling of processors
Tightly coupled: multiprocessors
Loosely coupled: multicomputers
Network or cluster of workstations (NOW, COW) Hybrid: loosely coupled clusters, each tightly coupled
Message passing versus virtual shared memory
Shared memory is easier to program
Message passing is more efficient

### 4.3 Global versus Distributed Memory



Fig. 4.3. A parallel processor with global memory.
Example processor-to-memory/processor networks:

1. Crossbar; $p \times m$ array of switches or crosspoints; cost too high for massively parallel systems
2. Single/multiple bus (complete or partial connectivity)
3. Multistage interconnection network (MIN); cheaper than crossbar, more bandwidth than bus


Parallel I/O
Fig. 4.4. A parallel processor with global memory and processor caches.

## Solving the cache coherence problem

1. Do not cache any shared data
2. Do not cache "writeable" shared data or allow only one cache copy
3. Use a cache coherence protocol (Chapter 18)


Fig. 4.5. A parallel processor with distributed memory.

Examples networks for distributed memory machines

1. Crossbar; cost too high for massively parallel systems
2. Single/multiple bus (complete or partial connectivity)
3. Multistage interconnection network (MIN)
4. Various direct networks (Section 4.5)

Terminology
UMA Uniform memory access
NUMA Nonuniform memory access
COMA Cache-only memory architecture (aka all-cache)

### 4.4 The PRAM Shared-Memory Model



Fig. 4.6. Conceptual view of a parallel randomaccess machine (PRAM).

## PRAM cycle

1. Processors access memory (usually different locations)
2. Processors perform a computation step
3. Processors store their results in memory


Fig. 4.7. PRAM with some hardware details shown.
In practice, memory is divided into modules and simultaneous accesses to the same module are disallowed

### 4.5 Distributed-Memory or Graph Models

Parameters of interest for direct interconnection networks Diameter
Bisection (band)width Node degree
Symmetry properties simplify algorithm development: Node or vertex symmetry Link or edge symmetry

Table 4.2. Topological Parameters of Selected Interconnection Networks

| Network name(s) | Number of nodes | Network diameter | Bisection width | Node degree | Local links? |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 1D mesh (linear array) | k | k-1 | 1 | 2 | Yes |
| 1D torus (ring, loop) | k | k/2 | 2 | 2 | Yes |
| 2D Mesh | $\mathrm{k}^{2}$ | $2 \mathrm{k}-2$ | k | 4 | Yes |
| 2D torus (k-ary 2-cube) | $\mathrm{k}^{2}$ | k | 2 k | 4 | Yes ${ }^{1}$ |
| 3D mesh | $\mathrm{k}^{3}$ | 3k-3 | $\mathrm{k}^{2}$ | 6 | Yes |
| 3D torus (k-ary 3-cube) | $k^{3}$ | 3k/2 | $2 \mathrm{k}^{2}$ | 6 | Yes ${ }^{1}$ |
| Pyramid | $\left(4 k^{2}-1\right) / 3$ | $2 \log _{2} k$ | 2k | 9 | No |
| Binary tree | 2-1 | 21-2 | 1 | 3 | No |
| 4-ary hypertree | $2^{1}\left(2^{1+1}-1\right)$ | 21 | $2^{1+1}$ | 6 | No |
| Butterfly | $2^{\prime}(1+1)$ | 21 | 2 | 4 | No |
| Hypercube | 21 | 1 | $2^{1-1}$ | 1 | No |
| Cube-connected cycles | 21 | 2 | $2^{1-1}$ | 3 | No |
| Shuffle-exchange | 2 | 21-1 | $\geq 2^{-1 / / I}$ | 4 unidir. | No |
| De Bruijn | 21 | I | 2 // | 4 unidir. | No |

${ }^{1}$ With folded layout.


Fig. 4.8. The sea of interconnection networks.

## Bus-based architectures are dominant in small-scale parallel systems.



Fig. 4.9. Example of a hierarchical interconnection architecture.

Because each interconnection network requires its own algorithms, various abstract (architecture-independent) models have been suggested for such networks

## The LogP model

Characterizes an architecture with just four parameters:
L Latency upper bound when a small message is sent from an arbitrary source to an arbitrary destination
o overhead, defined as the length of time a processor is dedicated to transmission or reception of a message, thus being unable to do any other computation
g gap, defined as the minimum time that must elapse between consecutive message transmissions or receptions by a single processor ( $1 / \mathrm{g}$ is the available per-processor communication bandwidth)

P Processor multiplicity (p in our notation)
If LogP is in fact an accurate model for capturing the effects of communication in parallel processors, then the details of interconnection network do not matter

## The BSP model (bulk-synchronous parallel)

Hides the communication latency altogether through a specific parallel programming style, thus making the network topology irrelevant
Synchronization of processors occurs once every L time steps, where $L$ is a periodicity parameter
Computation consists of a sequence of supersteps
In a given superstep, each processor performs a task consisting of local computation steps, message transmissions, and message receptions
Data received in messages will not be used in the current superstep but rather beginning with the next superstep
After each period of $L$ time units, a global check is made to see if the current superstep has been completed

If so, then the processors move on to executing the next superstep
Else, the next period of length $L$ is allocated to the unfinished super-step

### 4.6 Circuit Model and Physical Realizations



Fig. 4.10. Intrachip wire delay as a function of wire length.


Scaled up ant collapses under own weight.
Fig. 4.11. Pitfalls of scaling up.

## Part II Extreme Models

## Part Goals

- Study two extreme parallel machine models
- Abstract PRAM shared-memory model ignores implementation issues altogether
- Concrete circuit model accommodates details like circuit depth and layout area
- Prepare for everthing else that falls in between the two extremes


## Part Contents

- Chapter 5: PRAM and Basic Algorithms
- Chapter 6: More Shared-Memory Algorithms
- Chapter 7: Sorting and Selection Networks
- Chapter 8: Other Circuit-Level Examples


## 5 PRAM and Basic Algorithms

Chapter Goals

- Define PRAM and its various submodels
- Show PRAM to be a natural extension of the sequential computer (RAM)
- Develop five important parallel algorithms that can serve as building blocks (more algorithms in the next chapter)

Chapter Contents

- 5.1. PRAM Submodels and Assumptions
- 5.2. Data Broadcasting
- 5.3. Semigroup or Fan-in Computation
- 5.4. Parallel Prefix Computation
- 5.5. Ranking the Elements of a Linked List
- 5.6. Matrix Multiplication


### 5.1 PRAM Submodels and Assumptions



Fig. 4.6. Conceptual view of a parallel randomaccess machine (PRAM).

Processor i can do the following in 3 phases of one cycle: 1. Fetch an operand from address $\mathrm{s}_{\mathrm{i}}$ in shared memory 2. Perform computations on data held in local registers
3. Store a value into address $d_{i}$ in shared memory


Fig. 5.1 Submodels of the PRAM model.

CRCW PRAM is classified according to how concurrent writes are handled. These submodels are all different from each other and from EREW and CREW.
Undefined: In case of multiple writes, the value written is undefined (CRCW-U)
Detecting: A code representing "detected collision" is written (CRCW-D)
Common: Multiple writes allowed only if all store the same value (CRCW-C); this is sometimes called the consistent-write submodel
Random: The value written is randomly chosen from those offered (CRCW-R)
Priority: The processor with the lowest index succeeds in writing (CRCW-P)
Max/Min: The largest/smallest of the multiple values is written (CRCW-M)
Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of the multiple values is written.
One way to order these submodels is by their computational power:

```
EREW < CREW < CRCW-D
    < CRCW-C < CRCW-R < CRCW-P
```

Theorem 5.1: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with a slowdown factor of $\Theta(\log p)$.

### 5.2 Data Broadcasting

## Broadcasting is built-in for the CREW and CRCW models <br> EREW broadcasting: make $p$ copies of the data in a broadcast vector B

Making p copies of $\mathrm{B}[0]$ by recursive doubling
for $k=0$ to $\left\lceil\log _{2} p\right\rceil-1$ Processor j, $0 \leq j<p$, do
Copy $B[j]$ into $B\left[j+2^{k}\right]$
endfor


Fig. 5.2. Data broadcasting in EREW PRAM via recursive doubling.


Fig. 5.3. EREW PRAM data broadcasting without redundant copying.

EREW PRAM algorithm for broadcasting by Processor i
Processor i write the data value into $\mathrm{B}[0]$
$\mathrm{s}:=1$
while $\mathrm{s}<\mathrm{p}$ Processor j, $0 \leq \mathrm{j}<\min (\mathrm{s}, \mathrm{p}-\mathrm{s})$, do
Copy B[j] into B[j + s]
$\mathrm{s}:=2 \mathrm{~s}$
endwhile
Processor $j, 0 \leq j<p$, read the data value in $B[j]$

EREW PRAM algorithm for all-to-all broadcasting
Processor j, $0 \leq j<p$, write own data value into $B[j]$
for $k=1$ to $p-1$ Processor $j, 0 \leq j<p$, do
Read the data value in $B[(j+k) \bmod p]$
endfor

Both of the preceding algorithms are time-optimal (shared memory is the only communication mechanism and each processor can read but one value per cycle)

In the following naive sorting algorithm, processor j determines the rank $R[j]$ of its data element $S[j]$ by examining all the other data elements; it then writes S[j] in element R[j] of the output (sorted) vector

```
Naive EREW PRAM sorting algorithm
(using all-to-all broadcasting)
Processor j, \(0 \leq \mathrm{j}<\mathrm{p}\), write 0 into \(R[j]\)
for \(k=1\) to \(p-1\) Processor \(j, 0 \leq j<p\), do
    \(\mathrm{I}:=(\mathrm{j}+\mathrm{k}) \bmod \mathrm{p}\)
    if \(S[I]<S[j]\) or \(S[l]=S[j]\) and \(I<j\)
    then \(R[j]:=R[j]+1\)
    endif
endfor
Processor j, \(0 \leq j<p\), write \(\mathrm{S}[j]\) into \(\mathrm{S}[R[j]]\)
```

This $O(p)$-time algorithms is far from being optimal

### 5.3 Semigroup or Fan-in Computation

This computation is trivial for a CRCW PRAM of the reduction variety if the reduction operator happens to be $\otimes$


|  | $0: 0$ |
| :--- | ---: |
|  | $0: 1$ |
|  | $0: 2$ |
|  | $0: 3$ |
|  | $0: 4$ |
|  | $0: 5$ |
|  | $0: 6$ |
|  | $0: 7$ |
|  | $1: 8$ |
|  | $2: 9$ |


| $0: 0$ |
| ---: |
| $0: 1$ |
| $0: 2$ |
| $0: 3$ |
| $0: 4$ |
| $0: 5$ |
| $0: 6$ |
| $0: 7$ |
| $0: 8$ |
| $0: 9$ |

Fig. 5.4. Semigroup computation in EREW PRAM.
EREW PRAM semigroup computation algorithm
Processor j, $0 \leq \mathrm{j}<\mathrm{p}$, copy $\mathrm{X}[\mathrm{j}]$ into $\mathrm{S}[\mathrm{j}]$
$\mathrm{s}:=1$
while $\mathrm{s}<\mathrm{p}$ Processor j, $0 \leq \mathrm{j}<\mathrm{p}-\mathrm{s}$, do $S[j+s]:=S[j] \otimes S[j+s]$
$\mathrm{s}:=2 \mathrm{~s}$
endwhile
Broadcast S[p-1] to all processors

The preceding algorithm is time-optimal (CRCW can do better: problem 5.16)

Speed-up $=p / \log _{2} p$
Efficiency $=$ Speed-up/p $=1 / \log _{2} p$
Utilization $=\frac{W(p)}{p T(p)} \approx \frac{(p-1)+(p-2)+(p-4)+\ldots+(p-p / 2)}{p} 1-1 / \log _{2} p \quad 1$

Semigroup computation with each processor holding $n / p$ data elements:

Each processor combine its sublist n/p steps
Do semigroup computation on results $\log _{2} p$ steps
$\operatorname{Speedup}(n, p)=\frac{n}{n / p+2 \log _{2} \overline{\bar{p}}} \frac{p}{1+\left(2 p \log _{2} p\right) / n}$
Efficiency $(n, p)=$ Speedup $/ p=\frac{1}{1+\left(2 p \log _{2} p\right) / n}$

For $p=\Theta(n)$, a sublinear speedup of $\Theta(n / \log n)$ is obtained The efficiency in this case is $\Theta(n / \log n) / \Theta(n)=\Theta(1 / \log n)$

Limiting the number of processors to $p=O(n / \log n)$, yields:
Speedup $(\mathrm{n}, \mathrm{p})=\mathrm{n} / \mathrm{O}(\log \mathrm{n})=\Omega(\mathrm{n} / \log \mathrm{n})=\Omega(\mathrm{p})$
Efficiency $(\mathrm{n}, \mathrm{p})=\Theta(1)$
Using fewer processors than tasks = parallel slack


Fig. 5.5. Intuitive justification of why parallel slack helps improve the efficiency.

### 5.4 Parallel Prefix Computation



Fig. 5.6. Parallel prefix computation in EREW PRAM via recursive doubling.

## Two other solutions, based on divide and conquer



Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.

$$
T(p)=T(p / 2)+2=2 \log _{2} p
$$



Fig. 5.8. Another divide-and-conquer scheme for parallel prefix computation.

$$
T(p)=T(p / 2)+1=\log _{2} p \quad \text { Requires commutativity }
$$

### 5.5 Ranking the Elements of a Linked List



Fig. 5.9. Example linked list and the ranks of its elements.


Fig. 5.10. PRAM data structures representing a linked list and the ranking results.

List-ranking appears to be hopelessly sequential
However, we can in fact use a recursive doubling scheme to determine the rank of each element in optimal time There exist other problems that appear to be unparallizable This is why intuition can be misleading when it comes to determining which computations are or are not efficiently parallelizable (i.e., whether a computation is or is not in NC)


Fig. 5.11. Element ranks initially and after each of the three iterations.

## PRAM list ranking algorithm (via pointer jumping)

Processor $\mathrm{j}, 0 \leq \mathrm{j}<\mathrm{p}$, do \{initialize the partial ranks\}
if next[j] = j
then rank[j]:=0
else rank[j]:=1
endif
while $\operatorname{rank}[$ next[head] $] \neq 0$ Processor j, $0 \leq j<p$, do
$\operatorname{rank[j]~:=~rank[j]~+~rank[next[j]]~}$
next[j] := next[next[j]]
endwhile

Which PRAM submodel is implicit in the preceding algorithm?

### 5.6 Matrix Multiplication

For $m \times m$ matrices, $C=A \times B$ means: $\quad c_{i j}=\sum_{k=0}^{m-1} a_{i k} b_{k j}$
Sequential matrix multiplication algorithm

$$
\begin{aligned}
& \text { for } \mathrm{i}=0 \text { to } m-1 \text { do } \\
& \begin{array}{l}
\text { for } \mathrm{j}=0 \text { to } m-1 \text { do } \\
\mathrm{t}:=0 \\
\text { for } k=0 \text { to } m-1 \text { do } \\
t:=\mathrm{t}+\mathrm{a}_{\mathrm{ik}} \mathrm{~b}_{\mathrm{kj}} \\
\text { endfor } \\
\mathrm{c}_{\mathrm{ij}}:=\mathrm{t} \\
\text { endfor }
\end{array} \\
& \text { endfor }
\end{aligned}
$$



Fig. 5.12. PRAM matrix multiplication by using $\mathbf{p}=\mathbf{m}^{2}$ processors.

PRAM matrix multiplication algorithm using $\mathrm{m}^{2}$ processors
Processor (i, j), $0 \leq i, j<m$, do
begin
$t:=0$
for $k=0$ to $m-1$ do
$t:=t+a_{i k} b_{k j}$
endfor
$\mathrm{Cij}_{\mathrm{ij}}:=\mathrm{t}$
end


PRAM matrix multiplication algorithm using m processors

$$
\begin{aligned}
& \text { for } j=0 \text { to } m-1 \text { Processor } i, 0 \leq i<m \text {, do } \\
& t:=0 \\
& \text { for } k=0 \text { to } m-1 \text { do } \\
& \quad t:=t+a_{i k} b_{k j} \\
& \text { endfor } \\
& c_{i j}:=t \\
& \text { endfor }
\end{aligned}
$$

## Both of the preceding algorithms are efficient and provide linear speedup

Using fewer than m processors: each processor computes $\mathrm{m} / \mathrm{p}$ rows of C


The preceding solution is not efficient for NUMA parallel architectures

Each element of $B$ is fetched $m / p$ times
For each such data access, only two arithmetic operations are performed

## Block matrix multiplication



Fig. 5.13. Partitioning the matrices for block matrix multiplication.


Each multiply-add computation on $q \times q$ blocks needs $2 q^{2}=2 m^{2} / p$ memory accesses to read the blocks $2 q^{3}$ arithmetic operations
So, q arithmetic operations are done per memory access We assume that processor (i, j) has local memory to hold Block ( $\mathrm{i}, \mathrm{j}$ ) of the result matrix C ( $\mathrm{q}^{2}$ elements) One block-row of B; say Row kq + c of Block ( $k, j$ ) of B
(Elements of A can be brought in one at a time)

For example, as element in row iq +a of column $\mathrm{kq}+\mathrm{c}$ in block ( $\mathrm{i}, \mathrm{k}$ ) of A is brought in, it is multiplied in turn by the locally stored $q$ elements of $B$, and the results added to the appropriate $q$ elements of $C$


Fig. 5.14. How Processor ( $\mathrm{i}, \mathrm{j}$ ) operates on an element of $A$ and one block-row of $B$ to update one block-row of C .

On the Cm* NUMA-type shared-memory multiprocessor, this block algorithm exhibited good, but sublinear, speedup $p=16$, speed-up $=5$ in multiplying $24 \times 24$ matrices; improved to $9(11)$ for larger $36 \times 36(48 \times 48)$ matrices The improved locality of block matrix multiplication can also improve the running time on a uniprocessor, or distributed shared-memory multiprocessor with caches

Reason: higher cache hit rates.

## 6 More Shared-Memory Algorithms

Chapter Goals

- Develop PRAM algorithms for more complex problems
(background on corresponding sequential algorithms also presented)
- Discuss some practical implementation issues such as data distribution


## Chapter Contents

- 6.1. Sequential Rank-Based Selection
- 6.2. A Parallel Selection Algorithm
- 6.3. A Selection-Based Sorting Algorithm
- 6.4. Alternative Sorting Algorithms
- 6.5. Convex Hull of a 2D Point Set
- 6.6. Some Implementation Aspects


### 6.1 Sequential Rank-Based Selection

Selection: Find a (the) kth smallest among $n$ elements
Naive solution through sorting, $\mathrm{O}(\mathrm{n} \log \mathrm{n})$ time Linear-time sequential algorithm can be developed


## Sequential rank-based selection algorithm select(S, k)

1. if $|S|<q \quad\{q$ is a small constant $\}$
then sort $S$ and return the kth smallest element of $S$ else divide $S$ into |S|/q subsequences of size q Sort each subsequence and find its median Let the |S|/q medians form the sequence T endif
2. $m=\operatorname{select}(T,|T| / 2)$
\{find the median $m$ of the |S|/q medians\}
3. Create 3 subsequences

L : Elements of $S$ that are $<\mathrm{m}$
$E$ : Elements of $S$ that are $=m$
G: Elements of $S$ that are $>m$
4. if $|L| \geq k$
then return $\operatorname{select}(\mathrm{L}, \mathrm{k})$
else if $|L|+|E| \geq k$
then return $m$ else return $\operatorname{select}(\mathrm{G}, \mathrm{k}-|\mathrm{L}|-|\mathrm{E}|)$
endif

## Analysis:

$T(n)=T(n / q)+T(3 n / 4)+c n$

$$
\begin{aligned}
& \text { Let } q=5 \text {; we guess the solution to be } T(n)=d n \\
& d n=d n / 5+3 d n / 4+c n \quad \Rightarrow \quad d=20 c
\end{aligned}
$$

## Examples for sequential selection

 from an input list of size $\mathrm{n}=25$ using $\mathrm{q}=5$

To find the 5th smallest element in S, select the 5th smallest element in L


Answer: 1
The 9th smallest element of $S$ is 3

The 13th smallest element of $S$ is found by selecting the 4th smallest element in G


Answer: 4

### 6.2 A Parallel Selection Algorithm

Parallel rank-based selection algorithm PRAMselect(S, $k$, p)

1. if $|S|<4$
then sort $S$ and return the kth smallest element of $S$
else broadcast $|S|$ to all $p$ processors
divide $S$ into $p$ subsequences $S(j)$ of size $|S| / p$
Processor j, $0 \leq \mathrm{j}<\mathrm{p}$, compute $\mathrm{T}_{\mathrm{j}}:=\operatorname{select}(\mathrm{S}(\mathrm{j}), \mid \mathrm{S}(\mathrm{j}) / 2)$
endif
2. $m=\operatorname{PRAM}$ select $(T,|T| / 2, p) \quad$ \{median of the medians $\}$
3. Broadcast $m$ to all processors and create 3 subsequences

L : Elements of $S$ that are $<\mathrm{m}$
E : Elements of $S$ that are $=m$
G: Elements of $S$ that are $>m$
4. if $|L| \geq k$
then return PRAMselect(L, $k, p)$
else if $\mid$ ㄴ $+|E| \geq k$
then return $m$
else return PRAMselect( $G, k-|L|-|E|, p)$
endif
Analysis: Let $\mathrm{p}=\mathrm{n}^{1-\mathrm{x}}$, with $\mathrm{x}>0$ a known constant
e.g., $x=1 / 2 \Rightarrow p=\sqrt{n}$
$T(n, p)=T\left(n^{1-x}, p\right)+T(3 n / 4, p)+c n^{x}=O\left(n^{x}\right)$
Speed-up $(n, p)=\Theta(n) / O\left(n^{x}\right)=\Omega\left(n^{1-x}\right)=\Omega(p)$ Efficiency $=\Omega(1)$
What if $x=0$, i.e., we use $p=n$ processors for an $n$-input selection problem?

### 6.3 A Selection-Based Sorting Algorithm



Fig. 6.1. Partitioning of the sorted list for selectionbased sorting.

## Parallel selection-based sort PRAMselectionsort(S, p)

1. if $|S|<k$ then return quicksort(S)
2. for $i=1$ to $k-1$ do
$m_{j}:=\operatorname{PRAMselect}(\mathrm{S}, \mathrm{i}|\mathrm{S}| / \mathrm{k}, \mathrm{p})$
$\left\{\right.$ for notational convenience, let $\left.\mathrm{m}_{0}:=-\infty ; \mathrm{m}_{\mathrm{k}}:=+\infty\right\}$ endfor
3. for $\mathrm{i}=0$ to $\mathrm{k}-1$ do
make the sublist $T^{(i)}$ from elements of $S$ in ( $m_{i}, m_{i+1}$ ) endfor
4. for $\mathrm{i}=1$ to $\mathrm{k} / 2$ do in parallel

PRAMselectionsort(T(i), 2p/k)
\{p/(k/2) processors used for each
of the k/2 subproblems $\}$
endfor
5. for $i=k / 2+1$ to $k$ do in parallel

PRAMselectionsort(T(i), 2p/k)
endfor
Analysis: Let $p=n^{1-x}$, with $x>0$ a known constant, $k=2^{1 / x}$
$\mathrm{T}(\mathrm{n}, \mathrm{p})=2 \mathrm{~T}(\mathrm{n} / \mathrm{k}, 2 \mathrm{p} / \mathrm{k})+\mathrm{c} \mathrm{n}^{\mathrm{x}}=\mathrm{O}\left(\mathrm{n}^{\mathrm{x}} \log \mathrm{n}\right)$
Why can't all k subproblems be handled in Step 4 at once?

Speedup $(\mathrm{n}, \mathrm{p})=\Omega(\mathrm{n} \log \mathrm{n}) / \mathrm{O}\left(\mathrm{n}^{\mathrm{x}} \log \mathrm{n}\right)=\Omega\left(\mathrm{n}^{1-\mathrm{x}}\right)=\Omega(\mathrm{p})$ Efficiency $=$ Speedup $/ p=\Omega(1)$
Work(n, p) $=\mathrm{pT}(\mathrm{n}, \mathrm{p})=\Theta\left(\mathrm{n}^{1-\mathrm{x}}\right) \mathrm{O}\left(\mathrm{n}^{\mathrm{x}} \log \mathrm{n}\right)=\mathrm{O}(\mathrm{n} \log \mathrm{n})$
Our asymptotic analysis is valid for $x>0$ but not for $x=0$; i.e., PRAMselectionsort does not allow us to sort $p$ keys in optimal $\mathrm{O}(\log p)$ time

## Example:

Threshold values:

$$
\begin{aligned}
& \mathrm{mO}_{0}=-\infty \\
& \mathrm{n} / \mathrm{k}=25 / 4 \approx 6 \quad \mathrm{~m}_{1}=\operatorname{PRAMselect}(\mathrm{S}, 6,5)=2 \\
& 2 \mathrm{n} / \mathrm{k}=50 / 4 \approx 13 \quad \mathrm{~m}_{2}=\operatorname{PRAMselect}(\mathrm{S}, 13,5)=4 \\
& 3 \mathrm{n} / \mathrm{k}=75 / 4 \approx 19 \quad \mathrm{~m} 3=\text { PRAMselect }(\mathrm{S}, 19,5)=6 \\
& m 4=+\infty \\
& \mathrm{T}: \quad-\quad-\quad-2|-\quad-\quad-\quad 4|-\quad-\quad-6 \mid-\ldots-\ldots- \\
& \mathrm{T}: \quad \begin{array}{llllll|lllllll|llllll|llllll} 
& 0 & 1 & 1 & 1 & 2 & 2 & 3 & 3 & 4 & 4 & 4 & 4 & 5 & 5 & 5 & 5 & 5 & 6 & 6 & 6 & 7 & 7 & 8 & 9
\end{array}
\end{aligned}
$$

### 6.4 Alternative Sorting Algorithms

Sorting via random sampling
Given a large list $S$ of inputs, a random sample of the elements can be used to establish k comparison thresholds In fact, it would be easier if we pick $k=p$, so that each of the resulting subproblems is handled by a single processor.
Assume $\mathrm{p} \ll \sqrt{\mathrm{n}}$ :
Parallel randomized sort PRAMrandomsort(S, p)

1. Processor j, $0 \leq j<p$, pick $|S| / p^{2}$ random samples of its $|S| / p$ elements and store them in its corresponding section of a list $T$ of length $|S| / p$
2. Processor 0 sort the list $T$
\{the comparison threshold $m_{i}$ is the $\left(i|S| / p^{2}\right)$ th element of $\left.T\right\}$
3. Processor $\mathrm{j}, 0 \leq \mathrm{j}<\mathrm{p}$, store its elements falling in $\left(m_{i}, m_{i+1}\right)$ into $T^{(i)}$
4. Processor $\mathrm{j}, 0 \leq \mathrm{j}<\mathrm{p}$, sort the sublist $\mathrm{T}^{(i)}$

## Parallel radixsort

In binary version of radixsort, we examine every bit of the k-bit keys in turn, starting from the least-significant bit (LSB)
In Step i , bit i is examined, $0 \leq \mathrm{i}<\mathrm{k}$
The records are stably sorted by the value of the ith key bit
Example (keys are followed by their binary representations in parentheses):

| Input list | Sort by LSB | Sort by middle bit | Sort by MSB |
| :---: | :---: | :---: | :---: |
| 5 (101) | 4 (100) | 4 (100) | 1 (001) |
| 7 (111) | 2 (010) | 5 (101) | 2 (010) |
| 3 (011) | $\underline{2}$ (010) | 1 (001) | 2 (010) |
| 1 (001) | 5 (101) | 2 (010) | 3 (011) |
| 4 (100) | 7 (111) | 2 (010) | 4 (100) |
| 2 (010) | 3 (011) | 7 (111) | 5 (101) |
| 7 (111) | 1 (001) | 3 (011) | 7 (111) |
| 2 (010) | 7 (111) | 7 (111) | 7 (111) |

## Performing the required data movements

| Input list | Compl. of Bit 0 | Diminished prefix sums | Bit 0 | Prefix sums plus 2 | Shifted list |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 5 (101) | 0 | - | 1 | $1+2=3$ | 4 (100) |
| 7 (111) | 0 | - | 1 | $2+2=4$ | 2 (010) |
| 3 (011) | 0 | - | 1 | $3+2=5$ | $\underline{2}$ (010) |
| 1 (001) | 0 | - | 1 | $4+2=6$ | 5 (101) |
| 4 (100) | 1 | 0 | 0 | - | 7 (111) |
| 2 (010) | 1 | 1 | 0 | - 7 | 3 (011) |
| 7 (111) | 0 | - | 1 | $5+2=7$ | 1 (001) |
| 2 (010) | 1 | 2 | 0 | - | 7 (111) |

The running time consists mainly of the time to perform $2 k$ parallel prefix computations: $\mathrm{O}(\log p)$ for $k$ constant

### 6.5 Convex Hull of a 2D Point Set



Fig. 6.2. Defining the convex hull problem.

## Best sequential algorithm for $p$ points: $\Omega(p \log p)$ steps



Fig. 6.3. Illustrating the properties of the convex hull.

## Parallel convex hull algorithm PRAMconvexhull(S, p)

1. Sort the point set by the $x$ coordinates
2. Divide the sorted list into $\sqrt{p}$ subsets $Q(i)$ of size $\sqrt{p}, 0 \leq i<\sqrt{p}$
3. Find the convex hull of each subset $Q^{(i)}$ using $\sqrt{\mathrm{p}}$ processors
4. Merge the $\sqrt{\mathrm{p}}$ convex hulls $\mathrm{CH}\left(\mathrm{Q}^{(\mathrm{i})}\right)$ into the overall hull $\mathrm{CH}(\mathrm{Q})$


Fig. 6.4. Multiway divide and conquer for the convex hull problem.


Fig. 6.5. Finding points in a partial hull that belong to the combined hull.

## Analysis:

$$
T(p, p)=T\left(p^{1 / 2}, p^{1 / 2}\right)+c \log p \approx 2 c \log p
$$

## The intiail sorting time is also $\mathrm{O}(\log \mathrm{p})$

### 6.6 Some Implementation Aspects



Fig. 6.6. Matrix storage in column-major order to allow concurrent accesses to rows.


Fig. 6.7. Skewed matrix storage for conflict-free accesses to rows and columns.

$A_{i j}$ is viewed as vector element $i+j m$

Fig. 6.8. A $6 \times 6$ matrix viewed, in column-major order, as a 36 -element vector.

The vector in Fig. 6.8 may be accessed in some or all of the following ways
Column: $\quad k, k+1, k+2, k+3, k+4, k+5 \quad$ Stride $=1$
Row: $\quad k, k+m, k+2 m, k+3 m, k+4 m, k+5 m \quad$ Stride $=m$
Diagonal: $\quad k, k+m+1, k+2(m+1), k+3(m+1)$,

$$
k+4(m+1), k+5(m+1) \quad \text { Stride }=m+1
$$

Antidiagonal: $k, k+m-1, k+2(m-1), k+3(m-1)$,

$$
k+4(m-1), k+5(m-1)
$$

Stride $=m-1$
Linear skewing scheme:
stores the kth vector element in the bank a + kb mod B
The address within the bank is irrelevant to conflict-free parallel access
In fact, the constant a above is also irrelevant and can be safely ignored
So we can limit our attention to linear skewing schemes that assign $\mathrm{V}_{\mathrm{k}}$ to memory module $\mathrm{M}_{\mathrm{kb} \bmod } \mathrm{B}$
With a linear skewing scheme, the vector elements $k$, $k+s$, $\mathrm{k}+2 \mathrm{~s}, \cdots, \mathrm{k}+(\mathrm{B}-1) \mathrm{s}$ will be assigned to different memory modules iff sb is relatively prime with respect to the number $B$ of memory banks.

## To allow access from each processor to every memory bank, we need a permutation network

Even with a full permutation network (complex, expensive), full PRAM functionality is not realized
Practical processor-to-memory network cannot realize all permutations (they are blocking)


Fig. 6.9. Example of a multistage memory access network.

## 7 Sorting and Selection Networks

Chapter Goals

- Become familiar with the circuit-level models of parallel processing
- Introduce useful design tools and trade-off issues via a familiar problem
(three more application-specific examples to come in Chapter 8)


## Chapter Contents

- 7.1. What Is a Sorting Network?
- 7.2. Figures of Merit for Sorting Networks
- 7.3. Design of Sorting Networks
- 7.4. Batcher Sorting Networks
- 7.5. Other Classes of Sorting Networks
- 7.6. Selection Networks


### 7.1 What Is a Sorting Network?



Fig. 7.1. An n -input sorting network or an n -sorter.


Fig. 7.2. Block diagram and four different schematic representations for a 2-sorter.


Fig. 7.3. Parallel and bit-serial hardware realizations of a 2-sorter.


Fig. 7.4. Block diagram and schematic representation of a 4-sorter.

How to verify that the circuit of Fig. 7.4 is a valid 4 -sorter?
The answer is easy in this case
After the first two circuit levels, the top line carries the smallest and the bottom line the largest of the four values
The final 2 -sorter orders the middle two values
More generally, we need to verify the correctness of an nsorter through formal proofs or by time-consuming exhaustive testing. Neither approach is attractive.
The zero-one principle: A comparison-based sorter is valid iff it correctly sorts all $0 / 1$ sequences.


### 7.2 Figures of Merit for Sorting Networks

a. Cost: number of 2-sorter blocks used in the design b. Delay: number of 2 -sorters on the critical path

$\mathrm{n}=12,39$ modules, 9 levels

$\mathrm{n}=16,60$ modules, 10 levels

Fig. 7.5. Some low-cost sorting networks.





$\mathrm{n}=12,40$ modules, 8 levels

Fig. 7.6. Some fast sorting networks.

### 7.3 Design of Sorting Networks



Fig. 7.7. Brick-wall 6-sorter based on odd-even transposition.
$C(n)=C(n-1)+n-1=(n-1)+(n-2)+\cdots+2+1=n(n-1) / 2$
$D(n)=D(n-1)+2=2+2+\cdots+2+1=2(n-2)+1=2 n-3$
Cost $\times$ Delay $=n(n-1)(2 n-3) / 2=\Theta\left(n^{3}\right)$


Parallel insertion sort $=$ Parallel selection sort $=$ Parallel bubble sort $!$


Fig. 7.8. Sorting network based on insertion sort or selection sort.

### 7.4 Batcher Sorting Networks



Fig. 7.9. Batcher's even-odd merging network for $\mathbf{4 + 7}$ inputs.
$\mathrm{x}_{0} \leq \mathrm{x}_{1} \leq \cdots \leq \mathrm{x}_{\mathrm{m}-1} \quad$ (k Os) $\quad \mathrm{y}_{0} \leq \mathrm{y}_{1} \leq \cdots \leq \mathrm{ym}_{\mathrm{m}}-1 \quad$ (k' 0 s$)$
Merge $x_{0}, x_{2}, \cdots$ and $y_{0}, y_{2}, \cdots$ to get $v_{0}, v_{1}, \cdots \quad$ keven $=\lceil k / 2\rceil+\left\lceil\mathrm{k}^{\prime} / 2\right\rceil 0 s$ Merge $x_{1}, x_{3}, \cdots$ and $y_{1}, y_{3}, \cdots$ to get $w_{0}, w_{1}, \cdots \quad k_{\text {odd }}=\lfloor k / 2\rfloor+\left\lfloor k^{\prime} / 2\right\rfloor 0 s$ Compare-exchange the pairs of elements $\quad \mathrm{w}_{0}: \mathrm{v}_{1}, \mathrm{w}_{1}: \mathrm{v}_{2}, \mathrm{w}_{2}: \mathrm{v}_{3}, \cdots$
Case $a$ : $k_{\text {even }}=k_{\text {odd }} \quad$ The sequence $v_{0} w_{0} v_{1} w_{1} v_{2} w_{2} \cdots$ is already sorted Case $b$ : $k_{\text {even }}=k_{\text {odd }}+1$ The sequence $v_{0} w_{0} v_{1} w_{1} v_{2} w_{2} \cdots$ is already sorted Case c: Keven $=k_{\text {odd }}+2$

$$
\begin{array}{cccccccccccccc}
\mathrm{v} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\
\mathrm{w} & & 0 & 0 & 0 & 0 & 0 & 0 & \underline{1} & 1 & 1 & 1 & 1
\end{array}
$$

Batcher's ( $\mathrm{m}, \mathrm{m}$ ) even-odd merger, when m is a power of 2 , is characterized by the following recurrences:
$C(m)=2 C(m / 2)+m-1=(m-1)+2(m / 2-1)+4(m / 4-1)+\cdots=m \log _{2} m+1$
$D(m)=D(m / 2)+1=\log _{2} m+1$
Cost $\times$ Delay $=\Theta\left(m \log ^{2} m\right)$


Fig. 7.10. The recursive structure of Batcher's evenodd merge sorting network.


Fig. 7.11. Batcher's even-odd merge sorting network for eight inputs.

## Batcher sorting networks based on the even-odd merge technique are characterized by the following recurrences:

$C(n)=2 C(n / 2)+(n / 2)\left(\log _{2}(n / 2)\right)+1 \approx n\left(\log _{2} n\right)^{2 / 2}$
$D(n)=D(n / 2)+\log _{2}(n / 2)+1=D(n / 2)+\log _{2} n=\log _{2} n\left(\log _{2} n+1\right) / 2$
Cost $\times$ Delay $=\Theta\left(n \log ^{4} n\right)$

## Bitonic sorters

Bitonic sequence: "rises then falls", "falls then rises", or is obtained from the first two categories through cyclic shifts or rotations. Examples include:

133466622100 Rises, then falls
877666546889 Falls, then rises
898776665468 The previous sequence, right-rotated by 2


Fig. 7.12. The recursive structure of Batcher's bitonic sorting network.


Fig. 14.2. Sorting a bitonic sequence on a linear array.


Fig. 7.13. Batcher's bitonic sorting network for eight inputs.

### 7.5 Other Classes of Sorting Networks

Periodic balanced sorting networks


Fig. 7.14. Periodic balanced sorting network for eight inputs.

## Desirable properties:

a. Regular and modular (easier VLSI layout).
b. Slower, but more economical, implementations are possible by reusing the blocks
c. Using an extra block provides tolerance to some faults (missed exchanges)
d. Using 2 extra blocks provides tolerance to any single fault (a missed or incorrect exchange)
e. Multiple passes through a faulty network can lead to correct sorting (graceful degradation)
f. Single-block design can be made fault-tolerant by adding an extra stage to the block

## Shearsort-based sorting networks

Offer some of the same advantages enumerated for periodic balanced sorting networks


Fig. 7.15. Design of an 8 -sorter based on shearsort on $2 \times 4$ mesh.


Fig. 7.16. Design of an $\mathbf{8}$-sorter based on shearsort on $4 \times 2$ mesh.

### 7.6 Selection Networks

Any sorting network can be used as a selection network, but a selection network (yielding the kth smallest or largest input value) is in general simpler and faster
One can define three selection problems:
I. Select the k smallest values; present in sorted order
II. Select kth smallest value
III. Select the k smallest values; present in any order Circuit and time complexity: (I) hardest, (III) easiest


Fig. 7.17. A type III (8, 4)-selector.
Classifier: a selection network that can divide a set of $n$ values into $n / 2$ largest and $n / 2$ smallest values
The selection network of Fig. 7.17 is in fact an 8 -input classifier

# Generalizing the construction of Fig. 7.17, an n-input classifier can be built from two (n/2)-sorters followed by n/2 comparators 



Figure for Problem 7.7.


Figure for Problem 7.9.


Figure for Problem 7.11.

## 8 Other Circuit-Level Examples

Chapter Goals

- Study three application areas: dictionary operations, parallel prefix, DFT
- Develop circuit-level parallel architectures for solving these problems:
Tree machine
Parallel prefix networks
FFT circuits

Chapter Contents

- 8.1. Searching and Dictionary Operations
- 8.2. A Tree-Structured Dictionary Machine
- 8.3. Parallel Prefix Computation
- 8.4. Parallel Prefix Networks
- 8.5. The Discrete Fourier Transform
- 8.6. Parallel Architectures for FFT


### 8.1 Searching and Dictionary Operations

Parallel ( $p+1$ )-ary search:

$$
\log _{p+1}(n+1)=\log _{2}(n+1) / \log _{2}(p+1) \text { steps }
$$



This algorithm is optimal: no comparison-based search algorithm can be faster

Speed-up $\approx \log _{2}(p+1)$
A single search in a sorted list cannot be significantly speeded up by parallel processing, but all hope is not lost

Dynamic data sets
Batch searching

Basic dictionary operations: record keys $\mathrm{x}_{0}, \mathrm{x}_{1}, \cdots, \mathrm{x}_{\mathrm{n}-1}$
search(y) Find record with key y and return its data insert( $\mathrm{y}, \mathrm{z}$ ) Augment list with a record: $\mathrm{key}=\mathrm{y}$, data = z delete(y) Remove record with key $y$, return data

Some or all of the following ops might also be of interest:
findmin Find record with smallest key; return data
findmax Find record with largest key; return data
findmed Find record with median key; return data
findbest(y) Find record with key "nearest" to y
findnext(y) Find record whose key would appear immediately after y if ordered
findprev(y) Find record whose key would appear immediately before y if ordered
extractmin Remove record(s) with smallest key; return data?
extractmax Remove record(s) with largest key; return data?
extractmed Remove the record(s) with median key value; return data?

The operations "findmin" and "extractmin" (or "findmax" and "extractmax") are referred to as priority queue operations

### 8.2 A Tree-Structured Dictionary Machine



Fig. 8.1. A tree-structured dictionary machine.
The combining function of the triangular nodes is as follows:
search(y) Pass OR of "yes" signals, with data from "yes" side, or from either side if both indicate "yes"
findmin Pass the smaller of two key values, with data (findmax is similar; findmed not supported)
findbest(y) Pass the larger of two match-degree indicators, with the corresponding record
findnext(y) Leaf nodes generate a "larger" flag bit; findmin is performed among all larger values (findprev is similar)


Fig. 8.2. Tree machine storing five records and containing three free slots.


Fig. 8.3. Systolic data structure for minimum, maximum, and median finding.


Extractmin
Extractmed
Extractmax

### 8.3 Parallel Prefix Computation



Fig. 8.4. Prefix computation using a latched or pipelined function unit.

## Example: Prefix sums

| $x_{0}$ | $x_{1}$ | $x_{2}$ | $\cdots$ | $x_{i}$ |
| :--- | :--- | :--- | :--- | :--- |
| $x_{0}$ | $x_{0}+x_{1}$ | $x_{0}+x_{1}+x_{2}$ | $\cdots$ | $x_{0}+x_{1}+\ldots+x_{i}$ |
| $s_{0}$ | $s_{1}$ | $s_{2}$ | $\cdots$ | $s_{i}$ |



Fig. 8.5. High-throughput prefix computation using a pipelined function unit.

### 8.4 Parallel Prefix Networks



Fig. 8.6. Prefix sum network built of one $n / 2$-input networks and $\mathrm{n}-1$ adders.

$$
\begin{aligned}
& T(n)=T(n / 2)+2=2 \log _{2} n-1 \\
& C(n)=C(n / 2)+n-1=2 n-2-\log _{2} n
\end{aligned}
$$



Fig. 8.7. Prefix sum network built of two $\mathrm{n} / 2$-input networks and $\mathrm{n} / 2$ adders.

$$
\begin{aligned}
& T(n)=T(n / 2)+1=\log _{2} n \\
& C(n)=2 C(n / 2)+n / 2=(n / 2) \log _{2} n
\end{aligned}
$$



Fig. 8.8. Brent-Kung parallel prefix graph for $\mathbf{n}=16$.


Fig. 8.9. Kogge-Stone parallel prefix graph for $\mathbf{n}=$ 16.


Fig. 8.10. A hybrid Brent-Kung/Kogge-Stone parallel prefix graph for 16 inputs.

Type-x parallel prefix network
Produces the leftmost output in $\log _{2}$ (\# of inputs) time
Yields all other outputs with at most x additional delay Building the fastest possible parallel prefix network (type-0)


### 8.5 The Discrete Fourier Transform

$$
y_{i}=\sum_{j=0}^{n-1} \omega_{n}{ }^{i j} x_{j}
$$

The DFT is expressed in matrix form as $y=F_{n} x$

$$
\left[\begin{array}{c}
y_{0} \\
y_{1} \\
\vdots \\
\vdots \\
y_{n-1}
\end{array}\right]=\left[\begin{array}{ccccc}
1 & 1 & 1 & \ldots & 1 \\
1 & \omega_{n} & \omega_{n}{ }^{2} & \ldots & \omega_{n} n-1 \\
\vdots & \vdots & \vdots & \ldots & \vdots \\
1 & \omega_{n}{ }^{n-1} & \omega_{n}{ }^{2(n-1)} & \ldots & \omega_{n}(n-1)^{2}
\end{array}\right]\left[\begin{array}{c}
x_{0} \\
x_{1} \\
\vdots \\
\vdots \\
x_{n-1}
\end{array}\right]
$$

$\omega_{n}$ :nth primitive root of unity; $\omega_{n}{ }^{n}=1 \& \omega_{n}{ }^{j} \neq 1$ for $1 \leq j<n$
Examples: $\omega_{4}=\mathrm{i}=\sqrt{-1}, \omega_{3}=-1 / 2+\mathrm{i} \sqrt{3} / 2$
Inverse DFT, for recovering $x$, given $y$, is essentially the same computation as DFT:

$$
x_{i}=\frac{1}{n} \sum_{j=0}^{n-1} \omega_{n}^{-i j} y_{j}
$$

Any matrix-vector multiplication algorithm can be used to compute DFTs
However, the special structure of $F_{n}$ can be exploited to devise a much faster divide-and-conquer algorithm
The resulting algorithm is known as the fast Fourier transform (FFT)

## DFT Applications

## Spectral analysis



Tone frequency assignments for touch-tone dialing

Frequency spectrum of received tone

## Signal smoothing or filtering

Input signal with noise


Recovered smooth signal


## Fast Fourier Transform (FFT)

Partition the DFT sum into odd- and even-indexed terms

$$
\begin{aligned}
y_{i} & =\sum_{j=0}^{n-1} \omega_{n}{ }^{i j} x_{j}=\sum_{\text {even }(2 r)} \omega_{n}{ }^{i j} x_{j}+\sum_{j \text { odd }(2 r+1)} \omega_{n}{ }^{i j} x_{j} \\
& =\sum_{r=0}^{n / 2-1} \omega_{n / 2}{ }^{i r} x_{2 r}+\omega_{n} \sum_{r=0}^{n / 2-1} \omega_{n / 2}{ }^{i r} x_{2 r+1}
\end{aligned}
$$

The identity $\omega_{\mathrm{n} / 2}=\omega_{\mathrm{n}}{ }^{2}$ has been used in the derivation
The two terms in the last expression are $\mathrm{n} / 2$-point DFTs

$$
u=F_{n / 2}\left[\begin{array}{c}
x_{0} \\
x_{2} \\
\vdots \\
\vdots \\
x_{n-2}
\end{array}\right] \quad v=F_{n / 2}\left[\begin{array}{c}
x_{1} \\
x_{3} \\
\vdots \\
\vdots \\
x_{n-1}
\end{array}\right]
$$

Then:

$$
y_{i}=\begin{array}{ll}
u_{i}+\omega_{n}{ }^{i} v_{i} & 0 \leq i<n / 2 \\
u_{i-n / 2}+\omega_{n}^{i} v_{i-n / 2} & n / 2 \leq i<n \quad\left(\text { or } y_{i+n / 2}=u_{i}+\omega_{n}^{i+n / 2} v_{i}\right)
\end{array}
$$

Hence:n-point FFT = two n/2-point FFTs + n multiply-adds Sequential complexity of FFT:T(n) $=2 T(n / 2)+n=n \log _{2} n$ Unit of time = latency of one multiply-add operation If the two $\mathrm{n} / 2$-point subproblems are solved in parallel and the $n$ multiply-add operations are also concurrent, with their inputs supplied instantly, the parallel time complexity is:

$$
T(n)=T(n / 2)+1=\log _{2} n
$$

### 8.6 Parallel Architectures for FFT



Fig. 8.11. Butterfly network for an 8 -point FFT.


Fig. 8.12. FFT network variant and its sharedhardware realization.


Fig. 8.13. Linear array of $\log _{2} n$ cells for $n$-point FFT computation.

## Part III Mesh-Based Architectures

## Part Goals

- Study 2D mesh and torus networks in depth
- of great practical significance
- used in recent parallel machines
- regular with short wires (scalable)
- Discuss 3D and higher-dimensional meshes/tori as well as variants and derivative architectures

Part Contents<br>- Chapter 9: Sorting on a 2D Mesh or Torus<br>- Chapter 10: Routing on a 2D Mesh or Torus<br>- Chapter 11: Numerical 2D Mesh Algorithms<br>- Chapter 12: Mesh-Related Architectures

## 9 Sorting on a 2D Mesh or Torus

Chapter Goals

- Introduce the mesh model (processors, links, communication)
- Develop 2D mesh sorting algorithms
- Learn about mesh capabilities/weaknesses in communication-intensive problems


## Chapter Contents

- 9.1. Mesh-Connected Computers
- 9.2. The Shearsort Algorithm
- 9.3. Variants of Simple Shearsort
- 9.4. Recursive Sorting Algorithms
- 9.5. A Nontrivial Lower Bound
- 9.6. Achieving the Lower Bound


### 9.1 Mesh-Connected Computers



Fig. 9.1. Two-dimensional mesh-connected computer.

We focus on 2D mesh (>2D discussed in Chapter 12)
NEWS or four-neighbor mesh (others in Chapter 12)
Square $(\sqrt{\mathrm{p}} \times \sqrt{\mathrm{p}})$ or rectangular $(\mathrm{r} \times \mathrm{p} / \mathrm{r})$ mesh
MIMD, SPMD, or SIMD mesh
All-port versus single-port communication
Weak SIMD model: all communications in same direction
Diameter-based and bisection-based lower bounds


Fig. 9.2. A $5 \times 5$ torus folded along its columns. Folding this diagram along the rows will produce a layout with only short links.

Row-Major


Shuffled Row-Major

| 0 | 1 | 4 | 5 |
| ---: | ---: | ---: | ---: |
| 2 | 3 | 6 | 7 |
| 8 | 9 | 12 | 13 |
| 10 | 11 | 14 | 15 |

Snake-like Row-Major


Proximity Order


Fig. 9.3. Some linear indexing schemes for the processors in a 2D mesh.

## Interprocessor communication



Fig. 9.4. Reading data from NEWS neighbors via virtual local registers.

### 9.2 The Shearsort Algorithm

Shearsort algorihm for a 2D mesh with r rows
repeat $\left\lceil\log _{2} r\right\rceil$ times

endrepeat
Sort the rows


Snakelike

Fig. 9.5. Description of the shearsort algorithm on an r-row 2D mesh.

$$
T_{\text {shearsort }}=\left\lceil\log _{2} r\right\rceil(p / r+r)+p / r
$$

On a square $\sqrt{\mathrm{p}} \times \sqrt{\mathrm{p}}$ mesh, $\mathrm{T}_{\text {shearsort }}=\sqrt{\mathrm{p}}\left(\log _{2} \mathrm{p}+1\right)$

## Proof of correctness of shearsort via the 0-1 principle

 Assume that in doing the column sorts, we first sort pairs of elements in the column and then sort the entire columnRow 2 i
Row $2 \mathrm{i}+1$


Bubbles up in the next column sort

Case (a):
More 0s


Case (b):
More 1s


Sinks down in the next column sort

Fig. 9.6. A pair of dirty rows create at least one clean row in each shearsort iteration.


Fig. 9.7. The number of dirty rows halves with each shearsort iteration.


Fig. 9.8. Example of shearsort on a $4 \times 4$ mesh.

### 9.3 Variants of Simple Shearsort

Sorting 0s and 1s on a linear array: odd-even transposition steps can be limited to the number of dirty elements
Example: sorting 000001011111 requires at most two steps Thus, we can replace complete column sorts of shearsort with successively fewer odd-even transposition steps

$$
\begin{aligned}
& \mathrm{T}_{\text {opt shearsort }}=(\mathrm{p} / \mathrm{r})\left(\log _{2} r+1\right)+r+r / 2+\cdots+2 \\
& \quad=(\mathrm{p} / \mathrm{r})\left(\log _{2} r+1\right)+2 r-2\left[r=\sqrt{\mathrm{p}}: \sqrt{\mathrm{p}}\left(\frac{1}{2} \log _{2} \mathrm{p}+3\right)-2\right]
\end{aligned}
$$

Keys | 1 | 12 | 21 | 4 |
| :---: | :---: | :---: | :---: |
| 6 | 26 | 25 | 10 |
| 15 | 20 | 13 | 2 |
| 31 | 32 | 16 | 30 |
| 5 | 9 | 18 | 7 |
| 11 | 19 | 27 | 8 |
| 22 | 3 | 14 | 17 |
| 28 | 23 | 29 | 24 |



Two keys held by one processor

$$
\text { Row } \begin{array}{ccccc}
\hline 1 & 3 & 6 & 11 \\
2 & 4 & 8 & 12 \\
\text { sort } \\
15 & 13 & 9 & 5 \\
16 & 14 & 10 & 7 \\
17 & 19 & 23 & 28 \\
18 & 20 & 25 & 29 \\
31 & 27 & 24 & 21 \\
32 & 30 & 26 & 22 \\
\hline
\end{array}
$$

Column | 1 | 6 | 11 | 2 |
| :---: | :---: | :---: | :---: |
| 4 | 8 | 12 | 3 |
| 5 | 9 | 15 | 13 |
| 7 | 10 | 16 | 14 |
| 28 | 20 | 17 | 19 |
| 29 | 23 | 18 | 25 |
| 31 | 24 | 21 | 26 |
| 32 | 30 | 22 | 27 |



The final row sort (snake-like or row-major) is not shown.
Fig. 9.9. Example of shearsort on a $4 \times 4$ mesh with two keys stored per processor.

### 9.4 Recursive Sorting Algorithms



1. Sort quadrants

2. Sort columns

3. Sort rows

4. Apply $4 \sqrt{\mathrm{p}}$ steps of odd-even transposition along the snake

Fig. 9.10. Graphical depiction of the first recursive algorithm for sorting on a 2 D mesh based on four-way divide and conquer.

$$
T(\sqrt{p})=T(\sqrt{p} / 2)+5.5 \sqrt{p} \approx 11 \sqrt{p}
$$



Fig. 9.11. The proof of the first recursive sorting algorithm for 2D meshes.

$$
x \geq b+c+\lfloor(a-b) / 2\rfloor+\lfloor(d-c) / 2\rfloor
$$

A similar inequality for $x$ leads to:

$$
\begin{aligned}
x+x^{\prime} \geq & b+c+\lfloor(a-b) / 2\rfloor+\lfloor(d-c) / 2\rfloor \\
& \quad+a^{\prime}+d^{\prime}+\left\lfloor\left(b^{\prime}-a^{\prime}\right) / 2\right\rfloor+\left\lfloor\left(c^{\prime}-d^{\prime}\right) / 2\right\rfloor \\
\geq & b+c+a^{\prime}+d^{\prime}+(a-b) / 2+(d-c) / 2 \\
& \quad+\left(b^{\prime}-a^{\prime}\right) / 2+\left(c^{\prime}-d^{\prime}\right) / 2-4 \times 1 / 2 \\
= & \left(a+a^{\prime}\right) / 2+\left(b+b^{\prime}\right) / 2+\left(c+c^{\prime}\right) / 2+\left(d+d^{\prime}\right) / 2-2 \\
\geq & \sqrt{p}-4
\end{aligned}
$$

The number of dirty rows after Phase 3: $\sqrt{p}-x-x^{\prime} \leq 4$
Thus, at most $4 \sqrt{p}$ of the $p$ elements are out of order along the overall snake

## Another recursive sorting algorithm



1. Sort quadrants

2. Sort double columns in snake-like order

3. Shuffle row elements

4. Apply $2 \sqrt{\mathrm{p}}$ steps of odd-even transposition along the overall snake

Fig. 9.12. Graphical depiction of the second recursive algorithm for sorting on a 2D mesh based on four-way divide and conquer.

$$
T(\sqrt{p})=T(\sqrt{p} / 2)+4.5 \sqrt{p} \approx 9 \sqrt{p}
$$



Fig. 9.13. The proof of the second recursive sorting algorithm for 2D meshes.

### 9.5 A Nontrivial Lower Bound

We now have a $9 \sqrt{\mathrm{p}}$-time mesh sorting algorithm
Two questions of interest:

1. Can we raise the $2 \sqrt{\mathrm{p}}-2$ diameter-based lower bound?

Yes, for snakelike sort, the bound $3 \sqrt{p}-o(\sqrt{p})$
can be derived
2. Can we design an algorithm with better time than $9 \sqrt{\mathrm{p}}$ ?

Yes, the Schnorr-Shamir sorting algorithm
requires $3 \sqrt{p}+o(\sqrt{p})$ steps


Fig. 9.14. The proof of the $3 \sqrt{p}-o(\sqrt{p})$ lower bound for sorting in snakelike row-major order.

| 0 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 4 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 0 | 5 | 6 | 7 | 8 | 9 |
| 0 | 0 | 0 | 10 | 11 | 12 | 13 | 14 | 15 |
| 0 | 0 | 0 | 16 | 17 | 18 | 19 | 20 | 21 |
| 0 | 0 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 0 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
| 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
| 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |


| 64 | 64 | 64 | 64 | 64 | 1 | 2 | 3 | 4 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 64 | 64 | 64 | 64 | 5 | 6 | 7 | 8 | 9 |
| 64 | 64 | 64 | 10 | 11 | 12 | 13 | 14 | 15 |
| 64 | 64 | 64 | 16 | 17 | 18 | 19 | 20 | 21 |
| 64 | 64 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 64 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 |
| 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
| 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
| 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |



Fig. 9.15. Illustrating the effect of fewer or more 0 s in the shaded area.

### 9.6 Achieving the Lower Bound



Fig. 9.16. Notation for the asymptotically optimal sorting algorithm.

Schnorr-Shamir algorithm for snakelike sorting on a 2D mesh

1. Sort all blocks in snakelike order, independently \& in parallel
2. Permute the columns such that the columns of each vertical slice are evenly distributed among all vertical slices
3. Sort each block in snakelike order
4. Sort the columns independently from top to bottom
5. Sort Blocks 0\&1, 2\&3, ... of all vertical slices together in snakelike order; i.e., sort within $2 p^{3 / 8} \times p^{3 / 8}$ submeshes
6. Sort Blocks $1 \& 2,3 \& 4, \cdots$ of all vertical slices together in snake-like order; again done within $2 p^{3 / 8} \times p^{3 / 8}$ submeshes
7. Sort the rows independently in snakelike order
8. Apply $2 p^{3 / 8}$ steps of odd-even transposition to the snake

## 10 Routing on a 2-D Mesh or Torus

Chapter Goals

- Learn how to route multiple data items to their respective destinations
(in PRAM routing is nonexistent and in the circuit model it is hardwired)
- Become familiar with issues in packet routing and wormhole routing


## Chapter Contents

- 10.1. Types of Data Routing Operations
- 10.2. Useful Elementary Operations
- 10.3. Data Routing on a 2D Array
- 10.4. Greedy Routing Algorithms
- 10.5. Other Classes of Routing Algorithms
- 10.6. Wormhole Routing


### 10.1 Types of Data Routing Operations

One-to-one communication (point-to-point messages)

| a | b |  |  |
| :--- | :--- | :--- | :--- |
|  |  | c |  |
|  | d | e | f |
| g |  | h |  |

Packet sources



Packet destinations


Routing paths

Collective communication (per the MPI standard)
a. One to many: broadcast, multicast, scatter
b. Many to one: combine, fan-in, gather
c. Many to many: many-to-many m-cast, all-to-all b-cast, scatter-gather (gossiping), total exchange

## Some special data routing operations

a. Data compaction or packing

| a | b |  |  |
| :--- | :--- | :--- | :--- |
|  |  | c |  |
|  | d | e | f |
| g |  | h |  |


| a | b | c |  |
| :---: | :---: | :---: | :---: |
| d | e | f |  |
| g | h |  |  |
|  |  |  |  |

Fig. 10.1. Example of data compaction or packing.
b. Random-access write (RAW): Emulating one memory write step of a PRAM with p processors
c. Random-access read (RAR): Emulating one memory read step of a PRAM with p processors

### 10.2 Useful Elementary Operations

## Row or column rotation

Sorting records by a key field

## Semigroup computation



Horizontal combining $\approx \sqrt{\mathrm{p}} / 2$ steps


Vertical combining
$\approx \sqrt{\mathrm{p}} / 2$ steps

Fig. 10.2. Recursive semigroup computation in a 2D mesh.

## Parallel prefix computation



Quadrant Prefixes


Vertical Combining


Horizontal Combining (includes reversal)

Fig. 10.3. Recursive parallel prefix computation in a 2D mesh.

## Routing within a row or column



Fig. 10.4. Example of routing multiple packets on a linear array.

### 10.3 Data Routing on a 2D Array

Exclusive random-access write on a 2D mesh: MeshRAW

1. Sort packets in column-major order by destination column number; break ties by destination row number
2. Shift packets to the right, so that each item is in the correct column. There will be no conflict since at most one element in each row is headed for a given column
3. Route the packets within each column


|  | 0 | 1 | 2 | 3 |
| :---: | :---: | :---: | :---: | :---: |
| 0 | 0,0 |  | 0,2 |  |
| 1 | 1,0 | 1,1 | 1,2 | 1,3 |
| 2 |  |  | 2,2 | 2,3 |
| 3 | 3,0 | $3,1$ | 3,2 |  |

Fig. 10.5. Example of random-access write on a 2D mesh.

Not a shortest-path routing algorithm
e.g., packet headed to $(3,1)$ first goes to $(0,1)$

But fairly efficient

$$
\begin{array}{rlrl}
T & =3 p^{1 / 2}+o\left(p^{1 / 2}\right) & & \text { \{snakelike sorting \}} \\
& +p^{1 / 2} & & \text { column reversal\} } \\
& +2 p^{1 / 2}-2 & & \text { row \& column routing \}} \\
& =6 p^{1 / 2}+o\left(p^{1 / 2}\right) &
\end{array}
$$

Or $11 p^{1 / 2}+o\left(p^{1 / 2}\right)$ with unidirectional communication

### 10.4 Greedy Routing Algorithms

Greedy: pick a move that causes the most progress toward the destination in each step
Example greedy algorithm: dimension-order (e-cube)


Fig. 10.6. Example of greedy row-first routing on a 2D mesh.
$T=2 p^{1 / 2}-2$ \{but requires large buffers\}


Fig. 10.7. Demonstrating the worst-case buffer requirement with row-first routing.

## Routing algorithms thus far

Slow $6 p^{1 / 2}$, but with no conflict (no additional buffer) Fast $2 p^{1 / 2}$, but with large node buffers

An algorithm that allows trading off time for buffer space


Fig. 10.8. Illustrating the structure of the intermediate routing algorithm.

$$
\begin{array}{rlr}
T & =4 p^{1 / 2} / q+o\left(p^{1 / 2} / q\right) & \text { \{column-major block sort }\} \\
& +2 p^{1 / 2}-2 \quad\{\text { route }\} \\
& =(2+4 / q) p^{1 / 2}+o\left(p^{1 / 2} / q\right)
\end{array}
$$

Buffer space per node

$$
\begin{aligned}
& r_{k}=\text { number of packets in } B_{k} \text { headed for column } j \\
& \sum_{k=0}^{q-1}\left[\frac{r_{k}}{p^{1 / 2 / q}}\right]<\sum_{k=0}^{q-1}\left(1+\frac{r_{k}}{p^{1 / 2 / q}}\right) \leq q+\left(q / p^{1 / 2}\right) \sum_{k=0}^{q-1} r_{k} \leq 2 q
\end{aligned}
$$

### 10.5 Other Classes of Routing Algorithms

Row-first greedy routing has very good average-case performance, even if the node buffer size is restricted Idea: Convert any routing problem to two random instances by picking a random intermediate node for each message Using combining for concurrent writes:


Fig. 10.9. Combining of write requests headed for the same destination.

Terminology for routing problems or algorithms
Static: packets to be routed all available at $\mathrm{t}=0$
Dynamic: packets "born" in course of computation
Off-line: routes precomputed, stored in tables
On-line: routing decisions made on the fly
Oblivious: path depends only on source \& destination
Adaptive: path may vary by link and node conditions
Deflection: any received packet leaves immediately, even if this means misrouting (via detour path); also known as hot-potato routing

### 10.6 Wormhole Routing



Fig. 10.10. The notions of worms and deadlock in wormhole routing.

Any routing algorithm can be used to choose the path taken by the worm, but practical choices limited by the need for a quick decision
Example: row-first routing, with 2-byte header for row \& column displacements


Fig. 10.11. Various ways of dealing with conflicts in wormhole routing.

The deadlock problem in wormhole routing


Deadlock!
Two strategies for dealing with deadlocks:
(1) Avoidance
(2) Detection and recovery

Checking for deadlock potential via link dependence graph; existence of cycles may lead to deadlock


3-by-3 mesh with its links numbered


Unrestricted routing (following shortest path)


E-cube routing (row-first)

Fig. 10.12. Use of dependence graph to check for the possibility of deadlock.

## Using virtual channels

## Several virtual channels time-share one physical channel <br> Virtual channels serviced in round-robin fashion



Fig. 10.13. Use of virtual channels for avoiding deadlocks.


Figure for Problem 10.14.

## 11 Numerical 2D Mesh Algorithms

Chapter Goals

- Deal with a sample of numerical and seminumerical algorithms for meshes
- Introduce additional techniques for the design of mesh algorithms

Chapter Contents

- 11.1. Matrix Multiplication
- 11.2. Triangular System of Equations
- 11.3. Tridiagonal System of Equations
- 11.4. Arbitrary System of Linear Equations
- 11.5. Graph Algorithms
- 11.6. Image-Processing Algorithms


### 11.1 Matrix Multiplication

Matrix-vector multiplication $\quad y_{i}=\sum_{j=0}^{m-1} a_{i j} x_{j}$


Fig. 11.1. Matrix-vector multiplication on a linear array.


With $p=m$ processors, $T=2 m-1=2 p-1$

Matrix-matrix multiplication

$$
\mathrm{C}_{\mathrm{ij}}=\sum_{\mathrm{k}=0}^{\mathrm{m}-1} \mathrm{a}_{\mathrm{ik}} \mathrm{~b}_{\mathrm{kj}}
$$



Fig. 11.2. Matrix-matrix multiplication on a 2D mesh.

With $p=m$ processors, $T=3 m-2=3 \sqrt{p}-2$


Fig. 11.3. Matrix-vector multiplication on a ring.
With $\mathrm{p}=\mathrm{m}$ processors, $\mathrm{T}=\mathrm{m}=\mathrm{p}$


Fig. 11.4. Matrix-matrix multiplication on a torus.

With $p=m^{2}$ processors, $T=m=\sqrt{p}$
For $m>\sqrt{p}$, use block matrix multiplication communication can be overlapped with computation

### 11.2 Triangular System of Equations



Lower triangular (if $\mathrm{a}_{\mathrm{ii}}=0$, then it is strictly lower triangular)


Upper triangular
(if $\mathrm{a}_{\mathrm{ii}}=0$, then it is strictly upper triangular)

Fig. 11.5. Lower/upper triangular square matrix.


## Forward substitution (lower triangular) Back substitution (upper triangular)



Fig. 11.6. Solving a triangular system of linear equations on a linear array.


Fig. 11.7. Inverting a triangular matrix by solving triangular systems of linear equations.


Fig. 11.8. Inverting a lower triangular matrix on a 2D mesh.

### 11.3 Tridiagonal System of Linear Equations



Fig. 11.9. A tridiagonal system of linear equations.


Tridiagonal, pentadiagonal, matrices arise in the solution of differential equations using finite difference methods

Odd-even reduction: the ith equation can be rewritten as:

$$
x_{i}=\left(1 / d_{i}\right)\left(b_{i}-l_{i} x_{i-1}-u_{i} x_{i+1}\right)
$$

Take the $x_{i}$ equations for odd $i$ and plug into even-indexed equations (the ones with even subscripts for $I, d, u, b$ )

We get for each even $\mathrm{i}(0 \leq \mathrm{i}<\mathrm{m})$ an equation of the form:
$-\frac{l_{i-1} l_{i}}{d_{i-1}} x_{i-2}+\left(d_{i}-\frac{i_{i} u_{i-1}}{d_{i-1}}-\frac{u_{i} i_{i+1}}{d_{i+1}}\right) x_{i}-\frac{u_{i} u_{i+1}}{d_{i+1}} x_{i+2}=b_{i}-\frac{l_{i} i_{i-1}}{d_{i-1}}-\frac{u_{i} b_{i+1}}{d_{i+1}}$
Each equation formed needs 6 multiplies, 6 divides, 4 adds


Fig. 11.10. The structure of odd-even reduction for solving a tridiagonal system of equations.

Assuming unit-time arithmetic operations and $p=m$

$$
\mathrm{T}(\mathrm{~m})=\mathrm{T}(\mathrm{~m} / 2)+8 \approx 8 \log _{2} \mathrm{~m}
$$

The 6 divides can be replaced with 1 reciprocation per equation, to find $1 / \mathrm{d}_{\mathrm{j}}$ for each odd j , plus 6 multiplies
We have ignored interprocessor communication time. The analysis is thus valid only for PRAM or for an architecture whose topology matches the structure of Fig. 11.10.


Fig. 11.11. Binary $X$-tree (with dotted links) and multigrid architectures.

Odd-even reduction on a linear array of $p=m$ processors


Communication time $=2(1+2+4+\ldots+m / 2)=2 m-2$
Sequential complexity of odd-even reduction is also $\mathrm{O}(\mathrm{m})$
On an m-processor 2D mesh, odd-even reduction can be easily organized to require $\Theta(\sqrt{\mathrm{m}})$ time

### 11.4 Arbitrary System of Linear Equations

Gaussian elimination

$$
\begin{array}{rl}
2 x_{0}+4 x_{1}-7 x_{2}=3 & 2 x_{0}+4 x_{1}-7 x_{2}=7 \\
3 x_{0}+6 x_{1}-10 x_{2}=4 & 3 x_{0}+6 x_{1}-10 x_{2}=8 \\
-x_{0}+3 x_{1}-4 x_{2}=6 & -x_{0}+3 x_{1}-4 x_{2}=-1
\end{array}
$$

The extended A' matrix for these $\mathrm{k}=2$ sets of equations in $m=3$ unknowns has $m+k=5$ columns:

$$
A^{\prime}=\left[\begin{array}{ccccc}
2 & 4 & -7 & 3 & 7 \\
3 & 6 & -10 & 4 & 8 \\
-1 & 3 & -4 & 6 & -1
\end{array}\right]
$$

Divide row 0 by 2; add -3 times row 0 to row 1 and add 1 times row 0 to row 2 :

$$
\begin{aligned}
& A^{\prime}(0)=\left[\begin{array}{ccccc}
1 & 2 & -7 / 2 & 3 / 2 & 7 / 2 \\
0 & 0 & 1 / 2 & -1 / 2 & -5 / 2 \\
0 & 5 & -15 / 2 & 15 / 2 & 5 / 2
\end{array}\right] \\
& A^{\prime \prime}(0)=\left[\begin{array}{ccccc}
1 & 2 & -7 / 2 & 3 / 2 & 7 / 2 \\
0 & 5 & -15 / 2 & 15 / 2 & 5 / 2 \\
0 & 0 & 1 / 2 & -1 / 2 & -5 / 2
\end{array}\right] \\
& A^{\prime}(1)=\left[\begin{array}{ccccc}
1 & 0 & -1 / 2 & -3 / 2 & 5 / 2 \\
0 & 1 & -3 / 2 & 3 / 2 & 1 / 2 \\
0 & 0 & 1 / 2 & -1 / 2 & -5 / 2
\end{array}\right] \\
& A^{\prime}(2)=\left[\begin{array}{ccccc}
1 & 0 & 0 & -2 & 0 \\
0 & 1 & 0 & 0 & -7 \\
0 & 0 & 1 & -1 & -5
\end{array}\right]
\end{aligned}
$$

Solutions are read out from the last column of $\mathrm{A}^{\prime}(2)$

## Gaussian elimination on a 2D array



Fig. 11.12. A linear array performing the first phase of Gaussian elimination.


Fig. 11.13. Implementation of Gaussian elimination on a 2D array.


Fig. 11.14. Matrix inversion by Gaussian elimination.

## Jacobi relaxation

Assuming $\mathrm{a}_{\mathrm{ij}} \neq 0$, solve the ith equation for $\mathrm{x}_{\mathrm{i}}$, yielding m equations from which new (better) approximations to the answers can be obtained.

$$
x_{i}^{(t+1)}=\left(1 / a_{i j}\right)\left[b_{i}-\sum_{j \neq i} a_{i i} x_{j}^{(t)}\right] ; x_{i}^{(0)}=\text { initial approx for } x_{i}
$$

On an m-processor linear array, each iteration takes $O(m)$ steps. The number of iterations needed is $\mathrm{O}(\log \mathrm{m})$ in most cases, leading to $O(m \log m)$ average time.
A variant: Jacobi overrelaxation

$$
\mathrm{x}_{\mathrm{i}}^{(\mathrm{t}+1)}=(1-\gamma) \mathrm{x}_{\mathrm{i}}^{(\mathrm{t})}+\left(\gamma / \mathrm{a}_{\mathrm{ij}}\right)\left[\mathrm{b}_{\mathrm{i}}-\sum_{\mathrm{j} \neq \mathrm{i}} \mathrm{a}_{\mathrm{ii}} \mathrm{x}_{\mathrm{j}}^{(\mathrm{t})}\right] \quad 0<\gamma \leq 1
$$

For $\gamma=1$, the method is the same as Jacobi relaxation For smaller $\gamma$, overrelaxation may offer better performance
11.5 Graph Algorithms

$A=\left[\begin{array}{llllll}0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0\end{array}\right]$

$$
W=\left[\begin{array}{ccccc}
0 & 2 & 2 & \infty & 2 \\
1 & 0 & 2 & \infty & \infty \\
\infty & \infty & 0 & -3 & \infty \\
\infty & \infty & \infty & 0 & 0 \\
1 & \infty & \infty & \infty & 0
\end{array}\right]
$$

Fig. 11.15. Matrix representation of directed graphs.

The transitive closure of a graph
Graph with same node set but with an edge between two nodes if there is any path between them in original graph

$$
\begin{array}{ll}
A^{0}=I & \text { Paths of length } 0 \\
A^{1}=A & \text { Paths of length } 1
\end{array}
$$

Compute higher "powers" of A using matrix multiplication, except that AND/OR replace multiplication/addition
$A^{2}=A \times A \quad$ Paths of length 2
$A^{3}=A^{2} \times A \quad$ Paths of length $3 \quad$ etc.
The transitive closure has the adjacency matrix $A^{*}$

$$
A^{*}=A^{0}+A^{1}+A^{2}+\cdots \quad\left(A_{i j}^{*}=1 \text { iff } j\right. \text { is reachable from i) }
$$

To compute $A^{*}$, we need only proceed up to the term $A^{n-1}$; if there is a path from $i$ to $j$, there must be one of length $<n$
Rather than base the derivation of $A^{*}$ on computing the various powers of the Boolean matrix A, we can use the following simpler algorithm:

Phase $0 \quad$ Insert the edge ( $\mathrm{i}, \mathrm{j}$ ) into the graph if ( $\mathrm{i}, 0$ ) and $(0, j)$ are in the graph
Phase 1 Insert the edge ( $\mathrm{i}, \mathrm{j}$ ) into the graph if ( $\mathrm{i}, 1$ ) and $(1, j)$ are in the graph

Phase $k \quad$ Insert the edge ( $\mathrm{i}, \mathrm{j}$ ) into the graph if ( $\mathrm{i}, \mathrm{k}$ ) and $(k, j)$ are in the graph
Graph $A^{(k)}$ then has an edge $(i, j)$ iff there is a path from $i$ to $j$ that goes only through nodes $\{1,2, \cdots, k\}$ as intermediate hops

Phase $n-1 \quad$ The graph $A^{(n-1)}$ is the required answer $A^{*}$
A key question is how to proceed so that each phase takes $O(1)$ time for an overall $O(n)$ time on an $n \times n$ mesh
The $O(n)$ running time would be optimal in view of the $O\left(n^{3}\right)$ sequential complexity of the transitive closure problem


Fig. 11.16. Transitive closure algorithm on a 2D mesh.

## Systolic retiming



Original delays


Adjusted delays

Example of retiming by delaying the inputs to $C_{L}$ and advancing the outputs from $C_{L}$ by d units [Fig. 12.8 in Computer Arithmetic: Algorithms and Hardware Designs, by Parhami, Oxford, 2000]


Fig. 11.17. Systolic retiming to eliminate broadcasting.

The diagram on the left represents the preceding algorithm Zero-time horizontal arrows represent brodcasting by diagonal elements
The goal of systolization is to eliminate all zero-time transitions

To systolize the preceding example:
Add $2 n-2=6$ units of delay to edges crossing cut 1
Move 6 units of delay from inputs to outputs of node $(0,0)$

### 11.6 Image-Processing Algorithms

Labeling the connected components of a binary image


Fig. 11.18. Connected components in an $8 \times 8$ binary image.

Recursive algorithm, $\mathrm{p}=\mathrm{n}: \mathrm{T}(\mathrm{n})=\mathrm{T}(\mathrm{n} / 4)+\mathrm{O}(\sqrt{\mathrm{n}})=\mathrm{O}(\sqrt{\mathrm{n}})$


Fig. 11.19. Finding the connected components via divide and conquer.

## Lavialdi's algorithm

| 01 | 1 | 1 |
| :--- | :--- | :--- |
| 10 | 10 | 0 |
| if $N=W=1$ |  |  |

Fig. 11.20. Transformation or rewriting rules for Lavialdi's algorithm in the shrinkage phase (no other pixel changes).


Fig. 11.21. Example of the shrinkage phase of Lavialdi's algorithm.

$$
T(n)=2 \sqrt{n}-1\{\text { shrinkage }\}+2 \sqrt{n}-1 \text { \{expansion }\}
$$

Component do not merge in the shrinkage phase Consider a 0 that is about to become a 1

$$
\begin{array}{cccc}
x & 1 & y & \text { If any } y \text { is } 1 \text {, then already connected } \\
1 & 0 & y & \text { If } z \text { is } 1 \text { then it will change to } 0 \text { unless } \\
y & y & z & \text { at least one neighboring } y \text { is } 1
\end{array}
$$

## 12 Mesh-Related Architectures

Chapter Goals

- Study vairants of simple mesh architectures that offer higher performance or greater cost-effectiveness
- Learn about related architectures such as pyramids and mesh of trees


## Chapter Contents

- 12.1. Three or More Dimensions
- 12.2. Stronger and Weaker Connectivities
- 12.3. Meshes Augmented with Nonlocal Links
- 12.4. Meshes with Dynamic Links
- 12.5. Pyramid and Multigrid Systems
- 12.6. Meshes of Trees


### 12.1 Three or More Dimensions

3D mesh: $\quad D=3 p^{1 / 3}-3$ instead of $2 p^{1 / 2}-2$

$$
B=p^{2 / 3} \text { rather than } p^{1 / 2}
$$

Example: $\quad 8 \times 8 \times 8$ mesh
$D=21, B=64$ $22 \times 23$ mesh $\quad D=43, B=23$


Fig. 12.1. 3D and 2.5D physical realizations of a 3D mesh.

4D, 5D, ... meshes: optical links?
$q D$ mesh with $m$ processors along each dimension: $p=m q$
Node degree $\quad d=2 q$
Diameter
$D=q(m-1)=q\left(p^{1 / q}-1\right)$
Bisection width: $\quad B=p^{1-1 / q}$ when $m=p^{1 / q}$ is even
$q D$ torus with $m$ processors along each dimension
$=\mathrm{m}$-ary q-cube

## Sorting on a 3D mesh

A generalized form of shearsort is available However, the following algorithm (due to Kunde) is both faster and simpler. Let Processor (i, j, k) in an $m \times m \times m$ mesh be in Row i , Column j , and Layer k


Sorting on 3D mesh (zyx order; reverse of node index)
Phase 1: Sort elements on each zx plane into zx order Phase 2: Sort elements on each yz plane into zy order Phase 3: Sort elements on each xy layer into yx order (odd layers in reverse order).
Phase 4: Apply two steps of odd-even transposition along the $z$ direction
Phase 5: Sort elements on each xy layer into yx order
Time $=4 \times 2 \mathrm{D}$-sort time +2 steps

## Data routing on a 3D mesh

Greedy zyx (layer-first, row last) routing algorithm
Phase 1: Sort into zyx order by destination addresses
Phase 2: Route along z dimension to correct xy layer
Phase 3: Route along y dimension to correct column
Phase 4: Route along x dimension to destination node

## Matrix multiplication on a 3D mesh

Divide matrices into $\mathrm{m}^{1 / 4} \times \mathrm{m}^{1 / 4}$ arrays of $\mathrm{m}^{3 / 4} \times \mathrm{m}^{3 / 4}$ blocks


A total of $\left(m^{1 / 4}\right)^{3}=m^{3 / 4}$ block multiplications are needed Assume the use of an $\mathrm{m}^{3 / 4} \times \mathrm{m}^{3 / 4} \times \mathrm{m}^{3 / 4}$ mesh with $\mathrm{p}=\mathrm{m}^{9 / 4}$ Each $\mathrm{m}^{3 / 4} \times \mathrm{m}^{3 / 4}$ layer of the mesh is assigned to one of the $\mathrm{m}^{3 / 4} \times \mathrm{m}^{3 / 4}$ matrix multiplications ( $\mathrm{m}^{3 / 4}$ multiply-add steps)
The rest of the process takes time that is of lower order
The algorithm matches both the sequential work and the diameter-based lower bound

# Modeling of physical systems <br> Natural mapping of a 3D physical model to a 3D mesh 

## Low- vs. high-dimensional meshes

A low-dimensional mesh can simulate a high-dimensional mesh quite efficiently

It is thus natural to ask the following question:
Is it more cost effective, e.g., to have 4-port processors in a 2D mesh architecture or 6-port processors in a 3D mesh, given that for the 4-port processors, fewer ports and ease of layout allows us to make each channel wider?

### 12.2 Stronger and Weaker Connectivities

Fortified meshes


Node i connected to $i \pm 1$,
$\mathrm{i} \pm 7$, and $\mathrm{i} \pm 8(\bmod 19)$.
Fig. 12.2. Eight-neighbor and hexagonal (hex) meshes.

Oriented meshes (can be viewed as a type of pruning)


Fig. 12.3. A $4 \times 4$ Manhattan street network.

## Pruned meshes

Same diameter as ordinary mesh, but much lower cost.


Fig. 12.4. A pruned $4 \times 4 \times 4$ torus with nodes of degree four [Kwai97].

## Pruning and orientation can be combined

## Another form of pruning




Fig. 12.5. Eight-neighbor mesh with shared links and example data paths.

### 12.3 Meshes Augmented with Nonlocal Links

Motivation: to reduce the diameter; a weakness of meshes Bypass links or express channels along rows/columns


Fig. 12.6. Three examples of bypass links along the rows of a 2 D mesh.

Using a single global bus


Fig. 12.7. Mesh with a global bus and semigroup computation on it.

A $\sqrt{p} \times \sqrt{p}$ mesh with a single global bus can perform a semigroup computation $O\left(p^{1 / 3}\right)$ rather than $O\left(p^{1 / 2}\right)$ steps Assume that the semigroup operation $\otimes$ is commutative

## Semigroup computation on 2D mesh with a global bus

Phase 1: Find the partial results in $p^{1 / 3} \times p^{1 / 3}$ submeshes in $O\left(p^{1 / 3}\right)$ steps; results stored in the upper left corner of each submesh
Phase 2: Combine the partial results in $O\left(p^{1 / 3}\right)$ steps, using a sequential algorithm in one node and the global bus for data transfers
Phase 3: Broadcast the result to all nodes in one step

## Row and column buses



2D-mesh semigroup computation, row/column buses
Phase 1: Find the partial results in $p^{1 / 6} \times p^{1 / 6}$ submeshes in $O\left(p^{1 / 6}\right)$ steps
Phase 2: Distribute the $p^{1 / 3}$ values left on some of the rows among the $p^{1 / 6}$ rows in the same slice
Phase 3: Combine row values in $p^{1 / 6}$ steps (row bus)
Phase 4: Distribute column-0 values to $p^{1 / 3}$ columns
Phase 5: Combine column values in $\mathrm{p}^{1 / 6}$ steps
Phase 6: Use column buses to distribute the $\mathrm{p}^{1 / 3}$ values on row 0 among the $\mathrm{p}^{1 / 6}$ rows of row slice 0 in constant time
Phase 7: Combine row values in $\mathrm{p}^{1 / 6}$ steps Phase 8: Broadcast the result to all nodes (2 steps)

### 12.4 Meshes with Dynamic Links

Linear array with a separable bus


Fig. 12.9. Linear array with a separable bus using reconfiguration switches.

## Semigroup computation: $\mathrm{O}(\log p)$ steps

2D mesh with separable row/column buses

Reconfigurable mesh architecture


Fig. 12.10. Some processor states in a reconfigurable mesh.

### 12.5 Pyramid and Multigrid Systems



Fig. 12.11. Pyramid with 3 levels and $4 \times 4$ base along with its 2D layout.

Originally developed for image processing applications Roughly $3 / 4$ of the processors belong to the base
For an I-level pyramid: $D=2 \mid-2 \quad d=9 \quad B=2 l$
Semigroup computation faster than on mesh, but not sorting or arbitrary routing


Fig. 12.12. The relationship between pyramid and 2D multigrid architectures.

### 12.6 Meshes of Trees



Fig. 12.13. Mesh of trees architecture with 3 levels and a $4 \times 4$ base.


2-D layout for mesh of trees network with a 4-by-4 base (root nodes are in the middle row and column)


Fig. 12.14. Alternate views of the mesh of trees architecture with a $4 \times 4$ base.

Semigroup computation: done via row/column combining Parallel prefix computation: similar

Routing $\mathrm{m}^{2}$ packets, one per processor on the $\mathrm{m} \times \mathrm{m}$ base: row-first routing yields an $\Omega(\mathrm{m})=\Omega(\sqrt{\mathrm{p}})$ scheme
In the view of Fig. 12.14, with only m packets to be routed from one side of the network to the other, $2 \log _{2} m$ steps are required, provided that the destination nodes are all distinct

Sorting $\mathrm{m}^{2}$ keys, one per processor on the $\mathrm{m} \times \mathrm{m}$ base: emulate shearshort

In the view of Fig. 12.14, with only $m$ keys to be sorted, the following algorithm can be used (assume that row/column root nodes have been merged and each holds one key)

Sorting $m$ keys on a mesh of trees with an $m \times m$ base
Phase 1: Broadcast keys to leaves within both trees (leaf $\mathrm{i}, \mathrm{j}$ gets $\mathrm{x}_{\mathrm{i}}$ and $\mathrm{x}_{\mathrm{j}}$ )
Phase 2: At a base node:
if $\mathrm{x}_{\mathrm{j}}>\mathrm{x}_{\mathrm{i}}$ or $\mathrm{x}_{\mathrm{j}}=\mathrm{x}_{\mathrm{i}}$ and $\mathrm{j}>\mathrm{i}$ then flag := 1 else flag :=0
Phase 3: Add the "flag" values in column trees
(root $i$ obtains the rank of $x_{i}$ )
Phase 4: Route $x_{i}$ from root $i$ to root rank[i]

Matrix-vector multiplication $A x=y$ : matrix $A$ is stored on the base and vector $x$ in the column roots, say; the result vector $y$ is obtained in the row roots

Multiplying $m \times m$ matrix by $m$-vector on mesh of trees
Phase 1: Broadcast $x_{j}$ in the ith column tree
(leaf $\mathrm{i}, \mathrm{j}$ has $\mathrm{a}_{\mathrm{ij}}$ and $\mathrm{x}_{\mathrm{i}}$ )
Phase 2: At each base processor compute $a_{i j} x_{j}$
Phase 3: Sum over row trees
(row root i obtains $\sum_{i=0}^{m-1} a_{i j} x_{j}=y_{i}$ )
With pipelining, $r$ matrix-vector pairs multiplied in $21-2+r$ steps

## Convolution of two vectors

Assume the mesh of trees with an $m \times(2 m-1)$ base contains m diagonal trees in addition to the row and column trees, as shown in Fig. 12.15

Convolution of two m-vectors on a mesh of trees with an $m \times(2 m-1)$ base
Phase 1: Broadcast $x_{j}$ from the ith row root to all row nodes on the base

Phase 2: Broadcast $y_{m-1-1}$ from the diagonal root to the base diagonal
Phase 3: Leaf $\mathrm{i}, \mathrm{j}$, which has $\mathrm{x}_{\mathrm{i}}$ and $\mathrm{y}_{2 \mathrm{~m}-2-\mathrm{i}-\mathrm{j}}$, multiplies them to get $x_{i} y_{2 m-2 i-i}$
Phase 4: Sum columns to get $z_{2 m-2-j}=\sum_{i=0}^{m-1} x_{i} y_{2 m-2-i-j}$ in column root $j$

Phases 1 and 2 can be overlapped


Fig. 12.15. Mesh of trees variant with row, column, and diagonal trees.

Minimal-weight spanning tree for an undirected graph
A spanning tree of a connected graph is a subset of its edges that preserves the connectivity of all nodes in the graph but does not contain any cycle
A minimal-weight spanning tree (MWST) is a subset of edges that has the minimum total weight among all spanning trees
This is an important problem: if the graph represents a communication (transportation) network, MWSP tree might correspond to the best way to broadcast a message to all nodes (deliver products to the branches of a chain store from a central warehouse)

## Greedy sequential MWST algorithm

Assume all edge weights are unique so that there is always a single minimum-weight edge among any subset
At each step, we have a set of connected components or "supernodes" (initially n single-node components)

We connect each component to its "nearest" neighbor; i.e., we find the minimum-weight edge that connects the component to another one


Fig. 12.16. Example for the minimal-weight spanning tree algorithm.

If the graph's weight matrix W is stored in the leaves of a mesh of trees architecture, each phase requires $O\left(\log ^{2} n\right)$ steps with a simple algorithm (to be shown) and $\mathrm{O}(\log n)$ steps with a more sophisticated algorithm.
The total running time is thus $\mathrm{O}\left(\log ^{3} n\right)$ or $\mathrm{O}\left(\log ^{2} n\right)$.
Sequential algorihms and their time complexities:
Kruskal's: $\mathrm{O}(\mathrm{e} \log \mathrm{e}) \Rightarrow \mathrm{O}\left(\mathrm{n}^{2} \log \mathrm{n}\right)$ for dense graphs
Prim's (binary heap): $\mathrm{O}((e+n) \log n) \Rightarrow O\left(n^{2} \log n\right)$
Prim's (Fibonacci heap): $\mathrm{O}(\mathrm{e}+\mathrm{n} \log \mathrm{n}) \Rightarrow \mathrm{O}\left(\mathrm{n}^{2}\right)$
Thus, our best parallel solution offers a speedup of $O\left(n^{2} / \log ^{2} n\right)$; sublinear in the number $p=O\left(n^{2}\right)$ of processors
The key part of the simple parallel version of the greedy algorithm is showing that each phase can be done in O( $\log ^{2} n$ ) steps.
The algorithm for each phase consists of two subphases:
a. Find the min-weight edge incident to each supernode
b. Merge the supernodes for the next phase


Fig. 12.17. Finding the new supernode ID when several supernodes merge.

## Part IV Low-Diameter Architectures

## Part Goals

- Study the hypercube as an example of architectures with
- low (logarithmic) diameter
- wide bisection
- rich theoretical properties
- Present hypercube derivatives/alternatives that mitigate its realizability and scalability problems
- Complete our view of the "sea of interconnection networks"

Part Contents

- Chapter 13: Hypercubes and Their Algorithms
- Chapter 14: Sorting and Routing on Hypercubes
- Chapter 15: Other Hypercubic Architectures
- Chapter 16: A Sampler of Other Networks


## 13 Hypercubes and Their Algorithms

Chapter Goals

- Introduce the hypercube and its topological and algorithmic properties
- Design simple hypercube algorithms (sorting and routing to follow in Chapter 14)
- Learn about embeddings and their role in algorithm design and evaluation

Chapter Contents

- 13.1. Definition and Main Properties
- 13.2. Embeddings and Their Usefulness
- 13.3. Embedding of Arrays and Trees
- 13.4. A Few Simple Algorithms
- 13.5. Matrix Multiplication
- 13.6. Inverting a Lower Triangular Matrix


### 13.1 Definition and Main Properties

Brief history of the hypercube (binary q-cube) architecture
Concept developed: early 1960s [Squi63]
Direct (single-stage) \& indirect or multistage versions proposed for parallel processing: mid 1970s (early proposals [Peas77], [Sull77], no hardware)
Caltech's 64-node Cosmic Cube: early 1980s [Seit85] elegant solution to routing (wormhole routing) Several commercial machines: mid to late 1980s Intel PSC, CM-2, nCUBE (Section 22.3)

## Terminology

Hypercube generic term

3-cube, 4-cube, . . . , q-cube
when the number of dimensions is of interest

## A qD binary hypercube (q-cube) is defined recursively: <br> 1-cube: 2 connected nodes, labeled 0 and 1 $q$-cube consists of two ( $q-1$ )-cubes; $0 \& 1$ subcubes q-cube nodes labeled by preceding subcube node labels with 0 and 1 and connecting node $0 x$ to node $1 x$

Binary 1-cube built of two binary 0 -cubes labeled 0 and 1


Binary 2-cube built of two binary 1-cubes labeled 0 and 1
10


Three representations of a binary 3-cube




Two representations of a binary 4-cube


Fig. 13.1. The recursive hypercubes.

Number of nodes in a q-cube: $p=2^{q}$
Bisection width:
$B=p / 2=2^{q-1}$
Diameter:
$D=q=\log _{2} p$
Node degree:

$$
d=q=\log _{2} p
$$

The $q$ neighbors of node $x$ with binary ID $x_{q-1} x_{q-2} \cdots x_{2} x_{1} x_{0}$ :
$\mathrm{x}_{\mathrm{q}-1} \mathrm{x}_{\mathrm{q}-2} \cdots \mathrm{x}_{2} \mathrm{x}_{1} \overline{\mathrm{x}}_{0} \quad$ dimension-0 neighbor; $\mathrm{N}_{0}(\mathrm{x})$
$\mathrm{x}_{\mathrm{q}-1} \mathrm{x}_{\mathrm{q}-2} \cdots \mathrm{x}_{2} \overline{\mathrm{x}}_{1} \mathrm{x}_{0} \quad$ dimension-1 neighbor; $\mathrm{N}_{1}(\mathrm{x})$
$\bar{x}_{q-1} x_{q-2} \cdots x_{2} x_{1} x_{0} \quad$ dimension- $(q-1)$ neighbor; $N_{q-1}(x)$


Two nodes whose labels differ in k bits (have a Hamming distance of $k$ ) are connected by a shortest path of length $k$
Logarithmic diameter and linear bisection width are key reasons for the hypercube's high performance
Hypercube is both node- and edge-symmetric
Logarithmic node degree hinders hypercube's scalability

### 13.2 Embeddings and Their Usefulness



Fig. 13.2. Embedding a seven-node binary tree into 2D meshes of various sizes.
Examples of Fig. 13.2 $\rightarrow \quad 3 \times 3 \quad 2 \times 4 \quad 2 \times 2$$\begin{array}{lllll}\text { Dilation } & \text { Longest path onto which any edge is mapped } & 1 & 2 & 1\end{array}$(indicator of communication slowdown)
Congestion Max number of edges mapped onto one edge $\begin{array}{llll} & 2 & 2\end{array}$(indicator of contention during emulation)
Load factor Max number of nodes mapped onto one node $\begin{array}{llll} & 1 & 2\end{array}$(indicator of processing slowdown)
Expansion Ratio of number of nodes in the two graphs ..... 9/7 $\quad 8 / 7$ ..... 4/7 (indicator of emulation cost)

### 13.3 Embedding of Arrays and Trees



Fig. 13.3. Hamiltonian cycle in the $q$-cube.

## Proof of Hamiltonicity using Gray code:

(q-1)-bit codes
Assumed Gray code Assumed Gray code in reverse
q-bit Gray code


The $2^{m_{0}} \times 2^{m_{1}} \times \cdots \times 2^{m_{h-1}}$ mesh/torus is a subgraph of $q$-cube where $\mathrm{q}=\mathrm{m}_{0}+\mathrm{m}_{1}+\cdots+\mathrm{m}_{\mathrm{h}-1}$
This is akin to the mesh/torus being embedded in $q$-cube with dilation 1, congestion 1, load factor 1, and expansion 1 The proof is based on the notion of cross-product graphs, which we first define

Given $k$ graphs $G_{i}=\left(V_{i}, E_{i}\right), 1 \leq i \leq k$, their (cross-)product graph $G=G_{1} \times G_{2} \times \cdots \times G_{k}=(V, E)$ has:
node set $V=\left\{\left(v_{1}, v_{2}, \cdots, v_{k}\right) \mid v_{i} \in V_{i}, 1 \leq i \leq k\right\}$
edge set $E=\left\{\left[\left(u_{1}, u_{2}, \cdots, u_{k}\right),\left(v_{1}, v_{2}, \cdots, v_{k}\right)\right] \mid\right.$ for some $j,\left(u_{j}, v_{j}\right) \in E_{j}$ and for $\left.i \neq j, u_{i}=v_{i}\right\}$


Fig. 13.4. Examples of product graphs.
a. The $2^{m_{0}} \times 2^{m_{1}} \times \cdots \times 2^{m_{h-1}}$ torus is the product of $h$ rings of sizes $2^{m_{0}}, 2^{m_{1}}, \cdots, 2^{m_{n-1}}$
b. The $\left(m_{0}+m_{1}+\cdots+m_{h-1}\right)$-cube is the product of an $m_{0}$-cube, an $m_{1}$-cube, $\cdots$, an $m_{h-1}$-cube
c. The $2^{m_{i}}$-node ring is a subgraph of the $m_{i}$-cube
d. If component graphs are subgraphs of other component graphs, then the product graph will be a subgraph of the other product graph


Fig. 13.5. The $4 \times 4$ mesh/torus is a subgraph of the 4-cube.

Embedding ( $\left.2^{q}-1\right)$-node complete binary tree in q-cube Achieving dilation 1 is impossible


## Embedding the $2^{9}$-node double-rooted complete binary tree in q-cube


$2^{\mathrm{q}}$-node double-rooted complete binary tree


Double-rooted tree in the ( $\mathrm{q}-1$ )-cube 0

Double-rooted tree in the (q-1)-cube 1

Fig. 13.6. The 2 q -node double-rooted complete binary tree is a subgraph of the $q$-cube.


Fig. 13.7. Embedding a 15-node complete binary tree into the 3-cube.

### 13.4 A Few Simple Algorithms

Semigroup computation on the $q$-cube
Processor $\mathrm{x}, 0 \leq \mathrm{x}<\mathrm{p}$ do $\mathrm{t}[\mathrm{x}]:=\mathrm{v}[\mathrm{x}]$ \{initialize "total" to own value\} for $k=0$ to $q-1$ Processor $x, 0 \leq x<p$, do
get $y:=t\left[N_{k}(x)\right]$
set $\mathrm{t}[\mathrm{x}]:=\mathrm{t}[\mathrm{x}] \otimes \mathrm{y}$
endfor


Fig. 13.8. Semigroup computation on a 3-cube.

## Parallel prefix computation on the q-cube

Processor $x, 0 \leq x<p, \operatorname{dot}[x]:=u[x]:=v[x]$
\{initialize subcube "total" and partial prefix to own value\}
for $k=0$ to $q-1$ Processor $x, 0 \leq x<p$, do
get $y:=t\left[N_{k}(x)\right]$
set $\mathrm{t}[\mathrm{x}]:=\mathrm{t}[\mathrm{x}] \otimes \mathrm{y}$
if $x>N_{k}(x)$ then $u[x]:=u[x] \otimes y$
endfor


Fig. 13.9. Parallel prefix computation on a 3-cube.

Parallel prefixes in even and odd subcubes; own value excluded in the odd subcube computation



Exchange values and combine



Odd processors combine their own values


Fig. 13.10. A second algorithm for parallel prefix computation on a 3-cube.

Reversing a sequence on the $q$-cube
for $\mathrm{k}=0$ to $\mathrm{q}-1$ Processor $\mathrm{x}, 0 \leq \mathrm{x}<\mathrm{p}$, do
get $y:=v\left[N_{k}(x)\right]$
set $\mathrm{v}[\mathrm{x}]:=\mathrm{y}$
endfor


Fig. 13.11. Sequence reversal on a 3-cube.


### 13.5 Matrix Multiplication

Multiplying $\mathrm{m} \times \mathrm{m}$ matrices $(\mathrm{C}=\mathrm{A} \times \mathrm{B})$ on a q-cube,
where $m=2^{q / 3}$ and $p=m^{3}$

## Processor ( $0, j, k$ ) begins with $A_{j k}$ \& $B_{j k}$ in registers $R_{A}$ \& $R_{B}$ and ends with element $C_{j k}$ in register $R_{C}$

Multiplying $m \times m$ matrices on a $q$-cube, with $q=3 \log _{2} m$
for $\mathrm{I}=\mathrm{q} / 3-1$ downto 0 Processor $\mathrm{x}=\mathrm{ijk}, 0 \leq \mathrm{i}, \mathrm{j}, \mathrm{k}<\mathrm{m}$, do
if bit I of $i$ is 1
then get $y:=R_{A}\left[N_{I \pm 2 q / 3}(x)\right]$ and $z:=R_{B}\left[N_{I \pm 2 q / 3}(x)\right]$
set $R_{A}[x]:=y ; R_{B}[x]:=z$
endif
endfor
for $\mathrm{I}=\mathrm{q} / 3-1$ downto 0 Processor $\mathrm{x}=\mathrm{ijk}, 0 \leq \mathrm{i}, \mathrm{j}, \mathrm{k}<\mathrm{m}$, do if bit $I$ of $i$ and $k$ are different
then get $y:=R_{A}\left[N_{l}(x)\right]$
set $R_{A}[x]:=y$
endif
endfor
for $\mathrm{I}=\mathrm{q} / 3-1$ downto 0 Processor $\mathrm{x}=\mathrm{ijk}, 0 \leq \mathrm{i}, \mathrm{j}, \mathrm{k}<\mathrm{m}$, do if bit $I$ of $i$ and $j$ are different
then get $y:=R_{B}\left[N_{I \pm q / 3}(x)\right]$
set $R_{B}[x]:=y$
endif
endfor
Processor $x, 0 \leq x<p$, do $R_{C}:=R_{A} \times R_{B}$
$\left\{p=m^{3}=2 q\right.$ parallel multiplications in one step $\}$
for $\mathrm{I}=0$ to $\mathrm{q} / 3-1$ Processor $\mathrm{x}=\mathrm{ijk}, 0 \leq \mathrm{i}, \mathrm{j}, \mathrm{k}<\mathrm{m}$, do if bit I of $i$ is 0 then get $y:=R_{C}\left[N_{ \pm 2 q / 3}(x)\right]$
set $R_{C}[x]:=R_{C}[x]+y$
endif
endfor

$\left[\begin{array}{ll}1 & 2 \\ 3 & 4\end{array}\right] \times\left[\begin{array}{ll}5 & 6 \\ 7 & 8\end{array}\right]$





Fig. 13.12. Multiplying two $2 \times 2$ matrices on a 3-cube.

## Running time of the preceding algorithm: $\mathrm{O}(\mathrm{q})=\mathrm{O}(\log \mathrm{p})$

Analysis in the case of block matrix multiplication:
The $m \times m$ matrices are partitioned into $p^{1 / 3} \times p^{1 / 3}$ blocks of size $\left(m / p^{1 / 3}\right) \times\left(m / p^{1 / 3}\right)$
Each communication step involves $\mathrm{m}^{2} / \mathrm{p}^{2 / 3}$ block elements
Each multiplication involves $2 \mathrm{~m}^{3} / \mathrm{p}$ arithmetic operations

$$
T_{m u l}(m, p)=m^{2 /} / p^{2 / 3} \times O(\log p)+2 m^{3} / p
$$

Communication Computation

### 13.6 Inverting a Lower Triangular Matrix

For $A=\left[\begin{array}{ll}B & 0 \\ C & D\end{array}\right]$ we have $A^{-1}=\left[\begin{array}{cc}B^{-1} & 0 \\ -D^{-1} C B^{-1} & D^{-1}\end{array}\right]$
If $B$ and $D$ are inverted in parallel by independent subcubes, the algorithm's running time is characterized by:

$$
\begin{aligned}
\mathrm{T}_{\mathrm{inv}}(\mathrm{~m}) & =\mathrm{T}_{\mathrm{inv}}(\mathrm{~m} / 2)+2 \mathrm{~T}_{\mathrm{mul}}(\mathrm{~m} / 2) \\
& =\mathrm{T}_{\mathrm{inv}}(\mathrm{~m} / 2)+\mathrm{O}(\log \mathrm{~m})=\mathrm{O}\left(\log ^{2} \mathrm{~m}\right)
\end{aligned}
$$

## 14 Sorting and Routing on Hypercubes

Chapter Goals

- Present hypercube sorting algorithms, showing perfect fit to bitonic sorting
- Derive hypercube routing algorithms, utilizing elegant recursive methods
- Learn about inherent limitations of oblivious routing schemes


## Chapter Contents

- 14.1. Defining the Sorting Problem
- 14.2. Bitonic Sorting on a Hypercube
- 14.3. Routing Problems on a Hypercube
- 14.4. Dimension-Order Routing
- 14.5. Broadcasting on a Hypercube
- 14.6. Adaptive and Fault-Tolerant Routing
14.1. Defining the Sorting Problem

Arrange data in order of processor ID numbers (labels)


The ideal parallel sorting algorithm
$T(p)=\Theta((n \log n) / p)$
We cannot achieve this optimal running time for all $n$ and $p$
$1-1$ sorting ( $n=p$ )
Batcher's bitonic sort: $O\left(\log ^{2} n\right)=O\left(\log ^{2} p\right)$ time
Same for Batcher's odd-even merge sort
$\mathrm{O}(\log n)$-time deterministic algorithm not known
$k-k$ sorting ( $\mathrm{n}=\mathrm{pk}$ )
Optimal algorithms known for $n \gg p$ or when
average running time is considered (randomized)


Fig. 14.1. Examples of bitonic sequences.


Fig. 14.2. Sorting a bitonic sequence on a linear array.


Fig. 14.3. Sorting an arbitrary sequence on a linear array through recursive application of bitonic sorting.

$$
\begin{aligned}
T(p) & =T(p / 2)+B(p) \\
& =T(p / 2)+2 p-2=4 p-4-2 \log _{2} p
\end{aligned}
$$

Alternate derivation for the running time of bitonic sorting on a linear array:

$$
\begin{aligned}
T(p) & =B(2)+B(4)+\cdots+B(p) \\
& =2+6+14+\cdots+(2 p-2)=4 p-4-2 \log _{2} p
\end{aligned}
$$

For a linear array of processors, the bitonic sorting algorithm is clearly inferior to the simpler odd-even transposition sort which requires only p compare-exchange steps or $2 p$ unidirectional communication steps
However, the situation is quite different for a hypercube

### 14.2 Bitonic Sorting on a Hypercube

Sort lower $\left(x_{q-1}=0\right)$ and upper $\left(x_{q-1}=1\right)$ subcubes in opposite directions; yields a bitonic sequence Shifting the halves takes one compare-exchange step
$B(q)=B(q-1)+1=q$

```
Sorting a bitonic sequence of size n on q-cube, q= 知自
for I = q-1 downto 0 Processor x, 0\leqx<p, do
    if }\mp@subsup{x}{l}{}=
    then get y := v[N|}(\textrm{x})]\mathrm{ ; keep min}(\textrm{v}(\textrm{x}),\textrm{y})\mathrm{ ;
        send max(v(x), y) to Nl(x)
    endif
endfor
```

Bitonic sorting algorithm

$$
\begin{aligned}
T(q) & =T(q-1)+B(q)=T(q-1)+q \\
& =q(q+1) / 2=\log _{2} p\left(\log _{2} p+1\right) / 2
\end{aligned}
$$




Dimension 2

Dimension 1

Dimension 0

Fig. 14.4. Sorting a bitonic sequence of size 8 on the 3-cube.

### 14.3 Routing Problems on a Hypercube

Types of routing algorithms
Oblivious: path uniquely determined by node addresses
Nonoblivious or adaptive: the path taken by a message may also depend on other messages in the network
On-line: make the routing decisions on the fly as you route
Off-line: route selections are precomputed for each problem of interest and stored within nodes (routing tables)

Positive result for off-line routing on a p-node hypercube Any 1-1 routing problem with $p$ or fewer packets can be solved in $\mathrm{O}(\log p)$ steps, using an off-line algorithm
The off-line algorithm chooses routes in such a way that the route taken by one message does not significantly overlap or conflict with those of other messages

Negative result for oblivious routing on any network
Theorem 14.1: Let $G=(V, E)$ be a $p$-node, degree-d network. Any oblivious routing algorithm for routing $p$ packets in G needs $\Omega(\sqrt{\mathrm{p}} / \mathrm{d})$ worst-case time
For a hypercube: oblivious routing requires $\Omega(\sqrt{p} / \log p)$ time in the worst case (only slightly better than mesh)
In most instances, actual routing performance is much closer to the log-time best case than to the worst case.

## Proof Sketch for Theorem 14.1

Let $P_{u, v}$ be the unique path used for routing from $u$ to $v$
There are $p(p-1)$ paths for routing among all node pairs
These paths are predetermined and independent of other traffic within the network
Our strategy is to find $k$ node pairs $u_{i}, v_{i}(1 \leq i \leq k)$ such that
$u_{i} \neq u_{j}$ and $v_{i} \neq v_{j}$ for $i \neq j$, and
$P_{u_{i}, v_{i}}$ all pass through the same edge $e$
Since at most 2 packets can go through a link in each step, $\Omega(\mathrm{k})$ steps will be needed for some 1-1 routing problem The main part of the proof consists of showing that $k$ can be as large as $\sqrt{p} / \mathrm{d}$


### 14.4 Dimension-Order Routing

Route from node 01011011

$$
\text { to node } 11010110
$$

Dimensions that differ
Path: $\frac{01011011, ~ 11011011, ~ 11010011, ~}{11010111,1010110}$,

Unfolded hypercube (indirect cube, butterfly network) facilitates the discussion of routing algorithms
Dimension-order routing between nodes $i$ and $j$ in $a$ hypercube can be viewed as routing from node in column 0 (q) to node j in column q (0) of the butterfly


Fig. 14.5. Unfolded 3-cube or the 32-node butterfly network.

Self-routing in a butterfly
Fom node 3 to node 6: routing tag $=011 \oplus 110=101$ (this indicates the "cross-straight-cross" path)
From node 6 to node 1 : routing tag $110 \oplus 001=111$ (this represents a "cross-cross-cross" path)


Fig. 14.6. Example dimension-order routing paths.

The butterfly network cannot route all permutations without node or edge conflicts; e.g., any permutation involving the routes $(1,7)$ and $(0,3)$ leads to a conflict

The extent of conflicts depends on the routing problem
There exist "good" routing problems for which conflicts are non-existent or rare

There are also "bad" routing problems that lead to maximum conflicts and thus the worst-case running time predicted by Theorem 14.1


Fig. 14.7. Packing is a "good" routing problem for dimension-order routing on the hypercube.


Fig. 14.8. Bit-reversal permutation is a "bad" routing problem for dimension-order routing on the hypercube.

Message buffer needs of dimension-order routing
One may think that if we limit each node to a small, constant number of message buffers, then the above bound still holds, except that messages will be queued at several levels before reaching node 0

However, queuing the messages at multiple intermediate nodes may introduce additional delays that we have not accounted for, so that even the $\Theta(\sqrt{p})$ running time can no longer be guaranteed

Bad news: if each node of the hypercube is limited to $\mathrm{O}(1)$ buffers, there exist permutation routing problems that require $O(p)$ time; i.e., as bad as on a linear array!

Good news: the performance is usually much better; i.e., $\log _{2} p+o(\log p)$ for most permutations. Hence, the average running time of the dimension-order routing algorithm is very close to its best case and its message buffer requirements are quite modest.
Besides, if we anticipate any (near) worst-case routing pattern to occur in a given application, two options are available to us:

Compute the routing paths off-line and store in tables
Use randomized routing to convert the worst-case to average-case performance
The probabilistic analyses required to show the good average-case performance of dimension-order routing are quite complicated

Wormhole routing on a hypercube


Some of the preceding results are directly applicable here
Any good routing problem, yielding node- and edge-disjoint paths, will remain good for wormhole routing
In Fig. 14.7, e.g., the four worms carrying messages A, B, C, D, will move with no conflict among them. Each message is thus delivered to its destination in the shortest possible time, regardless of the length of the worms
For bad routing problems, on the other hand, wormhole routing aggravates the difficulties, since each message can now tie up a number of nodes and links
In the case of wormhole routing, one also needs to be concerned with deadlocks resulting from circular waiting of messages for one another
Dimension-order routing is always deadlock-free
With hot-potato or deflection routing, which is attractive for reducing the message buffering requirements within nodes, dimension orders are occasionally modified or more than one routing step along some dimensions may be allowed
Deadlock considerations in this case are similar to those of other adaptive routing schemes discussed in Section 14.6

### 14.5 Broadcasting on a Hypercube

## Simple "flooding" scheme with all-port communication

00000
00001, 00010, 00100, 01000, 10000
00011, 00101, 01001, 10001, 00110, 01010, 10010, 01100, 10100, 11000 Distance-2 nodes 00111, 01011, 10011, 01101, 10101, 11001, 01110, 10110, 11010, 11100 Distance-3 nodes

01111, 10111, 11011, 11101, 11110
11111

Source node
Neighbors of source

Distance-4 nodes
Distance-5 node

## Binomial broadcast tree with single-port communication



Fig. 14.9. The binomial broadcast tree for a 5-cube.



Pipelined binomial-tree scheme


Fig. 14.10. Three hypercube broadcasting schemes as performed on a 4-cube.

### 14.6 Adaptive and Fault-Tolerant Routing

Because there are up to q node-disjoint and edge-disjoint shortest paths between any node pairs in a q-cube, one can route messages around congested or failed nodes/links
A useful notion for designing adaptive wormhole routing algorithms is that of virtual communication networks


Subnetwork 0


Subnetwork 1

Fig. 14.11. Partitioning a 3-cube into subnetworks for deadlock-free routing.

Because each of the subnetworks in Fig. 14.11 is acyclic, any routing scheme that begins by using links in Subnet 0, at some point switches the routing path to Subnet 1, and from then on remains in Subnet 1, is deadlock-free
Fault diameter of $q$-cube is at most $q+1$ with $q-1$ or fewer faults and at most $q+2$ with $2 q-3$ or fewer faults [Lati93]


Figure for Problem 14.15.

## 15 Other Hypercubic Architectures

Chapter Goals

- Learn how the binary hypercube can be generalized to provide cost or performance benefits over the original version
- Derive algorithms for these architectures based on emulating a hypercube

Chapter Contents

- 15.1. Modified and Generalized Hypercubes
- 15.2. Butterfly and Permutation Networks
- 15.3. Plus-or-Minus-2i Network
- 15.4. The Cube-Connected Cycles Network
- 15.5. Shuffle and Shuffle-Exchange Networks
- 15.6. That's Not All Folks!


### 15.1 Modified and Generalized Hypercubes



Fig. 15.1. Deriving a twisted 3-cube by redirecting two links in a 4-cycle.


A diametral path in the 3-cube


Folded 3-cube

Fig. 15.2. Deriving a folded 3-cube by adding four diametral links.


Fig. 15.3. Folded 3-cube viewed as 3-cube with a redundant dimension.

A hypercube is a power or homogeneous product network
q -cube $=(\mathrm{O}-\mathrm{O})^{\mathrm{q}}$
q-cube $=$ qth power of $\mathrm{K}_{2}$
Generalized hypercube $=$ qth power of $K_{r}$ (node labels are radix-r numbers)
Example: radix-4 generalized hypercube
Node labels are radix-4 numbers
Node $x$ is connected to $y$ iff $x$ and $y$ differ in one digit Each node has r-1 dimension-k links

### 15.2 Butterfly and Permutation Networks



Fig. 15.4. Butterfly and wrapped butterfly networks.


Fig. 15.5. Butterfly network with permuted dimensions.

Fat trees eliminate the bisection bottleneck of an "skinny" tree by making the bandwidth of links correspondingly higher near the root


Fig. 15.6. Two representations of a fat tree.
One way of realizing a fat tree

Front view:


Fig. 15.7. Butterfly network redrawn as a fat tree.

## Butterfly as a multistage interconnection network



Fig. 15.8. Butterfly network used to connect modules that are on the same side.

Generalization of the butterfly network
High-radix or m-ary butterfly (built of $m \times m$ switches) Has $\mathrm{m}^{9}$ rows and $\mathrm{q}+1$ columns ( $q$ if wrapped)


Fig. 15.9. Benes` network formed from two back-toback butterflies.

## Benes` network can route any permutation

 (it is rearrangeable)

Fig. 15.10. Another example of a Beneš network.

### 15.3 Plus-or-Minus-2i ${ }^{\text {ºtwork }}$



Fig. 15.11. Two representations of the eight-node plus-or-minus-2 ${ }^{1}$ network.


Fig. 15.12. Augmented data manipulator network.
15.4 The Cube-Connected Cycles Network


Fig. 15.13. A wrapped butterfly (left) converted into cube-connected cycles.

How CCC was originally defined:


Fig. 15.14. Alternate derivation of CCC from a hypercube.

## Emulating normal hypercube algorithms on CCC



| Cycle ID $=\mathrm{x}$ | Proc ID $=\mathrm{y}$ |
| :---: | :---: |
| $2^{\mathrm{m}}$ bits | m bits |



Fig. 15.15. CCC emulating a normal hypercube algorithm.

### 15.5 Shuffle and Shuffle-Exchange Networks




Fig. 15.16. Shuffle, exchange, and shuffle-exchange connectivities.


Fig. 15.17. Alternate views of an eight-node shuffleexchange network.

# In a $2^{q}$-node shuffle network, node $x=x_{q-1} x_{q-2} \cdots x_{2} x_{1} x_{0}$ is connected to $x_{q-2} \cdots x_{2} x_{1} x_{0} x_{q-1}$ (cyclic left-shift of $x$ ) In the shuffle-exchange network, node $x$ is additionally connected to $x_{q-2} \cdots x_{2} x_{1} x_{0} \bar{x}_{q-1}$ <br> Routing in a shuffle-exchange network 

| Source |  |  | 01011011 |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Destination |  |  | 11010110 |  |  |
| Positions that differ |  |  |  |  |  |
| Route | $\underline{01011011}$ | Shuffle to | 10110110 | Exchange to | 10110111 |
|  | 10110111 | Shuffle to | 01101111 |  |  |
|  | $\underline{0} 1101111$ | Shuffle to | 11011110 |  |  |
|  | $\underline{11011110}$ | Shuffle to | 10111101 |  |  |
|  | 10111101 | Shuffle to | 01111011 | Exchange to | 01111010 |
|  | $\underline{0} 1111010$ | Shuffle to | 11110100 | Exchange to | 11110101 |
|  | 11110101 | Shuffle to | 11101011 |  |  |
|  | 11101011 | Shuffle to | 11010111 | Exchange to | 11010110 |

For ${ }^{2}$-node shuffle-exchange network: $\mathrm{D}=\mathrm{q}=\log _{2} \mathrm{p}, \mathrm{d}=4$

With shuffle and exchange links provided separately, as shown in Fig. 15.18, the diameter increases to $2 q-1$ and node degree reduces to 3


Fig. 15.18. Eight-node network with separate shuffle and exchange links.

Multistage shuffle-exchange network = butterfly network


Fig. 15.19. Multistage shuffle-exchange network (omega network) is the same as butterfly network.

### 15.6 That's Not All, Folks!

When $q$ is a power of 2 , the $2^{q} q$-node cube-connected cycles network derived from the q-cube, by replacing each node with a $q$-cycle, is a subgraph of the $\left(q+\log _{2} q\right)$-cube
Thus, CCC can be viewed as a pruned hypercube
Other pruning strategies are possible, leading to interesting tradeoffs


Fig. 15.20. Example of a pruned hypercube.

Mobius cube
Dimension-i neighbor of $x=x_{q-1} x_{q-2} \cdots x_{i+1} x_{i} \cdots x_{1} x_{0}$ is

$$
x_{q-1} x_{q-2} \cdots 0 \bar{x}_{i} \cdots x_{1} x_{0} \text { if } \quad x_{i+1}=0
$$

(as in the hypercube, $x_{i}$ is complemented)

$$
x_{q-1} x_{q-2} \cdots 1 \bar{x}_{i} \cdots \bar{x}_{1} \bar{x}_{0} \quad \text { if } \quad x_{i+1}=1
$$

( $\mathrm{x}_{\mathrm{i}}$ and all the bits to its right are complemented)
For dimension $q-1$, since there is no $x_{q}$, the neighbor can be defined in two ways,
leading to 0 - and 1 -Mobius cubes
A Mobius cube has a diameter of about $1 / 2$ and an average inter-node distance of about $2 / 3$ of that of a hypercube


Fig. 15.21. Two 8-node Mobius cubes.

## 16 A Sampler of Other Networks

Chapter Goals

- Study examples of composite or hybrid architectures
- Study examples of hierarchical or multilevel architectures
- Complete the picture of the sea of interconnection networks


## Chapter Contents

- 16.1. Performance Parameters for Networks
- 16.2. Star and Pancake Networks
- 16.3. Ring-Based Networks
- 16.4. Composite or Hybrid Networks
- 16.5. Hierarchical (Multilevel) Networks
- 16.6. Multistage Interconnection Networks


### 16.1 Performance Parameters for Networks



