# Introduction to Parallel Processing 

Algorithms and Architectures


Behrooz Parhami

Shared-Memory Parallelism

## Part II'



## About This Presentation

This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami

| Edition | Released | Revised | Revised | Revised |
| :--- | :---: | :---: | :---: | :---: |
| First | Spring 2005 | Spring 2006 | Fall 2008 | Fall 2010 |
|  |  | Winter 2013 | Winter 2014 | Winter 2016 |
|  |  | Winter 2019 | Winter 2020 | Winter 2021 |

## II' Shared-Memory Parallelism

Shared memory is the most intuitive parallel user interface:

- Abstract SM (PRAM); ignores implementation issues
- Implementation w/o worsening the memory bottleneck
- Shared-memory models and their performance impact

> | Topics in This Part |  |
| :--- | :--- |
| Chapter 5 | PRAM and Basic Algorithms |
| Chapter 6A | More Shared-Memory Algorithms |
| Chapter 6B | Implementation of Shared Memory |
| Chapter 6C | Shared-Memory Abstractions |

## 5 PRAM and Basic Algorithms

PRAM, a natural extension of RAM (random-access machine):

- Present definitions of model and its various submodels
- Develop algorithms for key building-block computations

| Topics in This Chapter |  |
| :--- | :--- |
| 5.1 | PRAM Submodels and Assumptions |
| 5.2 | Data Broadcasting |
| 5.3 | Semigroup or Fan-in Computation |
| 5.4 | Parallel Prefix Computation |
| 5.5 | Ranking the Elements of a Linked List |
| 5.6 | Matrix Multiplication |

Topics in This Chapter
5.1 PRAM Submodels and Assumptions
5.2 Data Broadcasting
5.3 Semigroup or Fan-in Computation
5.4 Parallel Prefix Computation
5.5 Ranking the Elements of a Linked List
5.6 Matrix Multiplication

## Why Start with Shared Memory？

Study one extreme of parallel computation models：
－Abstract SM（PRAM）；ignores implementation issues
－This abstract model is either realized or emulated
－In the latter case，benefits are similar to those of HLLs

In Part II＂，we will study the other extreme case of models：
－Concrete circuit model；incorporates hardware details
－Allows explicit latency／area／energy trade－offs
－Facilitates theoretical studies of speed－up limits
Everything else falls between these two extremes

### 5.1 PRAM Submodels and Assumptions



Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM).

Processor $i$ can do the following in three phases of one cycle:

1. Fetch a value from address $s_{i}$ in shared memory
2. Perform computations on data held in local registers
3. Store a value into address $d_{i}$ in shared memory

## Types of PRAM

## Reads from same location

Exclusive


Concurrent

## CREW

Default

Fig. 5.1 Submodels of the PRAM model.
Parallel Processing, Shared-Memory Parallelism

## Examples of Exclusive/Concurrent Reads/Writes

Exclusivity not enforced by hardware; rather, it's done by the programmer

Exclusive read:
for $0 \leq i<p$ processor $i$ read from location $i$
Exclusive write:
for $0 \leq i<p$ processor $i$ write into location $i+1 \bmod p$

Concurrent read:
for $0 \leq i<p$ processor $i$ read from location $i \bmod 2$
Concurrent write:
for $0 \leq i<p$ processor $i$ write into location $d_{i}$

## Types of CRCW PRAM

CRCW submodels are distinguished by the way they treat multiple writes:
Undefined: The value written is undefined (CRCW-U)
Detecting: A special code for "detected collision" is written (CRCW-D)
Common: Allowed only if they all store the same value (CRCW-C) [This is sometimes called the consistent-write submodel ]

Random: The value is randomly chosen from those offered (CRCW-R)
Priority: The processor with the lowest index succeeds (CRCW-P)
Max/Min: The largest/smallest of the values is written (CRCW-M)
Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of values is written

## Power of CRCW PRAM Submodels

Model U is more powerful than model V if $T_{\mathrm{U}}(n)=\mathrm{o}\left(T_{\mathrm{V}}(n)\right)$ for some problem

```
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
```

Theorem 5.1: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a $p$-processor EREW PRAM with slowdown factor $\Theta(\log p)$.

Intuitive justification for concurrent read emulation (write is similar):

Write the $p$ memory addresses in a list
Sort the list in ascending order of addresses Remove all duplicate addresses
Access data at desired addresses
Replicate data via parallel prefix computation
Each step requires constant or $\mathrm{O}(\log p)$ time

| Proc Addr |  |  |
| :---: | :---: | :---: |
| 0,1 | 0,1 | 0,1 |
| 1,6 | 6,1 |  |
| 2,5 | 7,1 |  |
| 3,2 | 3,2 | 3,2 |
| 4,3 | 8,2 |  |
| 5,6 | 4,3 | 4,3 |
| 6,1 | 2,5 | 2,5 |
| 7,1 | 1,6 | 1,6 |
| 8,2 | 5,6 |  |

## Implications of the CRCW Hierarchy of Submodels

## EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a $p$-processor EREW PRAM with slowdown factor $\Theta(\log p)$.

Our most powerful PRAM CRCW submodel can be emulated by the least powerful submodel with logarithmic slowdown

Efficient parallel algorithms have polylogarithmic running times
Running time still polylogarithmic after slowdown due to emulation
We need not be too concerned with the CRCW submodel used
Simply use whichever submodel is most natural or convenient

## Some Elementary PRAM Computations

Initializing an $n$-vector (base address $=B$ ) to all 0 s :

$$
\begin{aligned}
& \text { for } j=0 \text { to }\lceil n / p\rceil-1 \text { processor } i \text { do } \\
& \qquad h=j p+i \\
& \quad \text { if } h<n \text { then } M[B+h]:=0 \\
& \text { endfor }
\end{aligned}
$$



Adding two $n$-vectors and storing the results in a third (base addresses $B^{\prime}, B^{\prime \prime}, B$ )

Convolution of two $n$-vectors: $W_{k}=\sum_{i+j=k} U_{i} \times V_{j}$
(base addresses $B_{W}, B_{U}, B_{V}$ )

### 5.2 Data Broadcasting

Making $p$ copies of $B[0]$ by recursive doubling for $k=0$ to $\left\lceil\log _{2} p\right\rceil-1$ Proc $j, 0 \leq j<p$, do Copy $B[J]$ into $B\left[j+2^{k}\right]$ endfor

Can modify the algorithm so that redundant copying does not occur and array bound is not exceeded


Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling.


Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

## Class Participation: Broadcast-Based Sorting

Each person write down an arbitrary nonnegative integer with 3 or fewer digits on a piece of paper

Students take turn broadcasting their numbers by calling them out aloud

Each student puts an $X$ on paper for every number
 called out that is smaller than his/her own number, or is equal but was called out before the student's own value

Each student counts the number of $X$ s on paper to determine the rank of his/her number

Students call out their numbers in order of the computed rank

## All-to-All Broadcasting on EREW PRAM

EREW PRAM algorithm for all-to-all broadcasting
Processor $j, 0 \leq j<p$, write own data value into $B[j]$ for $k=1$ to $p-1$ Processor $j, 0 \leq j<p$, do

Read the data value in $B[j+k) \bmod p]$ endfor

This $O(p)$-step algorithm is time-optimal


Naive EREW PRAM sorting algorithm (using all-to-all broadcasting)
Processor $j, 0 \leq j<p$, write 0 into $R[j]$
for $k=1$ to $p-1$ Processor $j, 0 \leq j<p$, do
$I:=(j+k) \bmod p$
if $S[I]<S[j]$ or $S[I]=S[j]$ and $I<j$
then $R[j]:=R[j]+1$
endif
endfor

This $\mathrm{O}(p)$-step sorting algorithm is far from optimal; sorting is possible in $\mathrm{O}(\log p)$ time

Processor $j, 0 \leq j<p$, write $S[j]$ into $S[R[j]]$

### 5.3 Semigroup or Fan-in Computation

EREW PRAM semigroup computation algorithm
Proc $j, 0 \leq j<p$, copy $X[J]$ into $S[J]$ $s:=1$
while $s<p \operatorname{Proc} j, 0 \leq j<p-s$, do

$$
S[j+s]:=S[j] \otimes S[j+s]
$$

$$
s:=2 s
$$

endwhile
Broadcast $S[p-1]$ to all proc's


| $0: 0$ |
| :---: |
| $0: 1$ |
| $0: 2$ |
| $0: 3$ |
| $0: 4$ |
| $0: 5$ |
| $0: 6$ |
| $0: 7$ |
| $0: 8$ |
| $0: 9$ |

Fig. 5.4 Semigroup computation in EREW PRAM.

This algorithm is optimal for PRAM, but its speedup of $O(p / \log p)$ is not

If we use $p$ processors on a list of size $n=\mathrm{O}(p \log p)$, then optimal speedup can be achieved


Lower degree of parallelism near the root

Higher degre of parallelism near the leave

Fig. 5.5 Intuitive justification of why parallel slack helps improve the efficiency.

### 5.4 Parallel Prefix Computation

Same as the first part of semigroup computation (no final broadcasting)



Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.

## A Divide-and-Conquer Parallel-Prefix Algorithm



Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.

## Another Divide-and-Conquer Algorithm



$$
\begin{aligned}
& T(p)=T(p / 2)+1 \\
& T(p)=\log _{2} p
\end{aligned}
$$

Strictly optimal algorithm, but requires commutativity

Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.

Slide 19

### 5.5 Ranking the Elements of a Linked List



Distance from head:
$\qquad$
Fig. 5.9 Example linked list and the ranks of its elements.

List ranking appears to be hopelessly sequential; one cannot get to a list element except through its predecessor!

Fig. 5.10 PRAM data structures representing a linked list and the ranking results.

rank


## List Ranking via Recursive Doubling



Fig. 5.11 Element ranks initially and after each of the three iterations.

## PRAM List Ranking Algorithm

PRAM list ranking algorithm (via pointer jumping) Processor $j, 0 \leq j<p$, do \{initialize the partial ranks\} if $n e x t[j]=j$
then $\operatorname{rank}[j]:=0$
else $\operatorname{rank}[j]:=1$
endif
while $\operatorname{rank}[n e x t[h e a d]] \neq 0$ Processor $j, 0 \leq j<p$, do $\operatorname{rank}[j]$ := $\operatorname{rank}[j]+\operatorname{rank[next[j]]}$ next[j]:= next[next[j]] endwhile

Question: Which PRAM submodel is implicit in this algorithm?

If we do not want to modify the original list, we simply make a copy of it first, in constant time

Answer: CREW

Answer: CREW

### 5.6 Matrix Multiplication

Sequential matrix multiplication for $i=0$ to $m-1$ do

$$
\text { for } j=0 \text { to } m-1 \text { do }
$$

$$
t:=0
$$

$$
\text { for } k=0 \text { to } m-1 \text { do }
$$

$$
t:=t+a_{i k} b_{k j}
$$

endfor

$$
c_{i j}:=t
$$



PRAM solution with $m^{3}$ processors: each processor does one multiplication (not very efficient)

$$
c_{i j}:=\sum_{k=0 \text { to } m-1} a_{i k} b_{k j}
$$



Parallel Processing, Shared-Memory Parallelism

## PRAM Matrix Multiplication with $m^{2}$ Processors

PRAM matrix multiplication using $m^{2}$ processors
$\operatorname{Proc}(i, j), 0 \leq i, j<m$, do begin
$t:=0$
for $k=0$ to $m-1$ do $t:=t+a_{i k} b_{k j}$
endfor
end $c_{i j}:=t$

Processors are numbered (i, j), instead of 0 to $m^{2}-1$

$$
\Theta(m) \text { steps: Time-optimal }
$$

CREW model is implicit


Fig. 5.12 PRAM matrix multiplication; $p=m^{2}$ processors.

## PRAM Matrix Multiplication with $m$ Processors

PRAM matrix multiplication using $m$ processors
for $j=0$ to $m-1$ Proc $i, 0 \leq i<m$, do
$t:=0$
for $k=0$ to $m-1$ do

$$
t:=t+a_{i k} b_{k j}
$$

endfor
$c_{i j}:=t$
endfor

$\Theta\left(m^{2}\right)$ steps: Time-optimal
CREW model is implicit
Because the order of multiplications is immaterial, accesses to $B$ can be skewed to allow the EREW model


Slide 25

## PRAM Matrix Multiplication with Fewer Processors

Algorithm is similar, except that each processor is in charge of computing $m / p$ rows of $C$
$\Theta\left(m^{3} / p\right)$ steps: Time-optimal
EREW model can be used

A drawback of all algorithms thus far is that only two arithmetic operations (one multiplication and one addition) are performed for each memory access.

This is particularly costly for NUMA shared-memory machines.


## More Efficient Matrix Multiplication (for NUMA)

## Partition the matrices into $p$ square blocks



One processor computes these elements of C that it holds in local memory


Block matrix multiplication follows the same algorithm as simple matrix multiplication.

Fig. 5.13 Partitioning the matrices for block matrix multiplication .

## Details of Block Matrix Multiplication



Fig. 5.14 How Processor ( $i, j$ ) operates on an element of $A$ and one block-row of $B$ to update one block-row of $C$.

## 6A More Shared-Memory Algorithms

Develop PRAM algorithm for more complex problems:

- Searching, selection, sorting, other nonnumerical tasks
- Must present background on the problem in some cases

| Topics in This Chapter |  |
| :--- | :--- |
| 6A. 1 | Parallel Searching Algorithms |
| 6A. 2 | Sequential Rank-Based Selection |
| 6A. 3 | A Parallel Selection Algorithm |
| 6A. 4 | A Selection-Based Sorting Algorithm |
| 6A. 5 | Alternative Sorting Algorithms |
| 6A. 6 | Convex Hull of a 2D Point Set |

## 6A. 1 Parallel Searching Algorithms

Searching an unordered list in PRAM
Sequential time: $n$ worst-case $n / 2$ on average

Divide the list of $n$ items into $p$ segments of $\lceil n / p\rceil$ items (last segment may have fewer)

Processor $i$ will be in charge of $\lceil n / p\rceil$ list elements, beginning at address $i\lceil n / p\rceil$

Parallel time:
$\lceil n / p\rceil$ worst-case ??? on average

Perfect speed-up of $p$ with $p$ processors?
Pre- and postprocessing overheads

Example: $n=24, p=4$


Slide 30

## Parallel ( $p+1$ )-ary Search on PRAM

$p$ probes, rather than 1, per step

$$
\begin{aligned}
& \log _{p+1}(n+1) \\
& \quad=\log _{2}(n+1) / \log _{2}(p+1) \\
& \quad=\Theta(\log n / \log p) \text { steps }
\end{aligned}
$$

Speedup $\cong \log p$
Optimal: no comparison-based search algorithm can be faster

A single search in a sorted list can't be significantly speeded up through parallel processing, but all hope is not lost:

Dynamic data (sorting overhead)
Batch searching (multiple lookups)

Example:
$n=26, p=2$


From
Sec. 8.1

## 6A. 2 Sequential Ranked-Based Selection

Selection: Find the (or a) $k$ th smallest among $n$ elements
Example: 5th smallest element in the following list is 1 :
$\begin{array}{llllllllllllllllllllllll}6 & 4 & 5 & 6 & 7 & 1 & 5 & 3 & 8 & 2 & 1 & 0 & 3 & 4 & 5 & 6 & 2 & 1 & 7 & 1 & 4 & 5 & 4 & 9\end{array}$

Naive solution through sorting, $\mathrm{O}(n \log n)$ time

But linear-time sequential algorithm can be developed


Slide 32

## Linear-Time Sequential Selection Algorithm

## Sequential rank-based selection algorithm select( $S, k$ )

1. if $|S|<q \quad\{q$ is a small constant $\}$
then sort $S$ and return the $k$ th smallest element of $S$
$O(n)$ else divide $S$ into $|S| / q$ subsequences of size $q$ Sort each subsequence and find its median Let the $|S| / q$ medians form the sequence $T$
endif
$T(n / q)$ 2. $m=\operatorname{select}(T,|T| / 2) \quad$ \{find the median $m$ of the $|S| / q$ medians\}
2. Create 3 subsequences


To be justified
$T(3 n / 4)$
$L$ : Elements of $S$ that are $<m$
$E$ : Elements of $S$ that are $=m$
G: Elements of $S$ that are $>m$
4. if $|L| \geq k$
then return $\operatorname{select}(L, k)$
else if $|L|+|E| \geq k$
then return $m$
else return $\operatorname{select}(G, k-|L|-|E|)$ endif
endif


## Algorithm Complexity and Examples

$$
T(n)=T(n / q)+T(3 n / 4)+c n
$$

We must have $q \geq 5$;
for $q=5$, the solution is $T(n)=20 \mathrm{cn}$


To find the 5th smallest element in $S$, select the 5th smallest element in $L$


The 9th smallest element of $S$ is 3.

The 13th smallest element of $S$ is found by selecting the 4th smallest element in $G$.

Answer: 1
Parallel Processing, Shared-Memory Parallelism


## 6A. 3 A Parallel Selection Algorithm

## Parallel rank-based selection algorithm $\operatorname{PRAMselect(S,k,p)}$

1. if $|S|<4$
then sort $S$ and return the $k$ th smallest element of $S$

## $\mathrm{O}\left(n^{x}\right)$

 else broadcast $|S|$ to all $p$ processorsdivide $S$ into $|S| / q$ subsequences $S(j)$ of size $q$
Processor $j, 0 \leq j<p$, compute $T_{j}:=\operatorname{select}(S(j),|S(j)| / 2)$
endif
$T\left(n^{1-x}, p\right)$ 2. $m=\operatorname{PRAMselect}(T,|T| / 2, p) \quad$ \{median of the medians $\}$
3. Broadcast $m$ to all processors and create 3 subsequences
$L$ : Elements of $S$ that are $<m$
$E$ : Elements of $S$ that are $=m$
$G$ : Elements of $S$ that are $>m$
4. if $|L| \geq k$
then return PRAMselect $(L, k, p)$ else if $|L|+|E| \geq k$
then return $m$
else return $\operatorname{PRAMselect(G,k-|L|-|E|,~p)~}$
endif
endif
Let $p=\mathrm{O}\left(n^{1-x}\right)$


## Algorithm Complexity and Efficiency

$T(n, p)=T\left(n^{1-x}, p\right)+T(3 n / 4, p)+c n^{x}$
The solution is $\mathrm{O}\left(n^{x}\right)$; verify by substitution

Speedup $=\Theta(n) / O\left(n^{x}\right)=\Omega\left(n^{1-x}\right)=\Omega(p)$
Efficiency $=\Omega(1)$

## Remember

$p=O\left(n^{1-x}\right)$

Work $(n, p)=p T(n, p)=\Theta\left(n^{1-x}\right) \Theta\left(n^{x}\right)=\Theta(n)$

What happens if we set $x$ to 1 ? (i.e., use one processor)

$$
T(n, 1)=O\left(n^{x}\right)=O(n)
$$

What happens if we set $\boldsymbol{x}$ to 0 ? (i.e., use $n$ processors)

$$
T(n, n)=O\left(n^{x}\right)=O(1) ?
$$

No, because in asymptotic analysis, we ignored several O(log $n$ ) terms compared with $\mathrm{O}\left(n^{x}\right)$ terms

## Data Movement in Step 2 of the Algorithm



Consider the sublist $L$ : Processor $i$ contributes $a_{i}$ items to this sublist Processor 0 starts storing at location 0 , processor 1 at location $a_{0}$, processor 2 at location $a_{0}+a_{1}$, Processor 3 at location $a_{0}+a_{1}+a_{2}, \ldots$

## 6A. 4 A Selection-Based Sorting Algorithm

$O(1)$
$O\left(n^{x}\right)$
$O\left(n^{x}\right)$

Parallel selection-based sort PRAMselectionsort( $S, p$ )

1. if $|S|<k$ then return quicksort(S)
2. for $i=1$ to $k-1$ do
$m_{j}:=P R A M \operatorname{select}(S, i|S| / k, p)\left\{\operatorname{let} m_{0}:=-\infty ; m_{k}:=+\infty\right\}$
endfor
3. for $i=0$ to $k-1$ do make the sublist $T(i)$ from elements of $S$ in $\left(m_{i}, m_{i+1}\right)$ endfor
4. for $i=1$ to $k / 2$ do in parallel PRAMselectionsort(T(i), 2plk) endfor
5. for $i=k / 2+1$ to $k$ do in parallel PRAMselectionsort(T(i), 2plk)
$T(n / k, 2 p / k)$ endfor

> Let $p=n^{1-x}$
> and $k=2^{1 / x}$


Fig. 6.1 Partitioning of the sorted list for selection-based sorting.

## Algorithm Complexity and Efficiency

$T(n, p)=2 T(n / k, 2 p / k)+c n^{x}$
The solution is $\mathrm{O}\left(n^{\times} \log n\right)$; verify by substitution

Speedup $(n, p)=\Omega(n \log n) / O\left(n^{\times} \log n\right)=\Omega\left(n^{1-x}\right)=\Omega(p)$
Efficiency $=$ speedup $/ p=\Omega(1)$
$\operatorname{Work}(n, p)=p T(n, p)=\Theta\left(n^{1-x}\right) \Theta\left(n^{\times} \log n\right)=\Theta(n \log n)$

What happens if we set $x$ to 1 ? (i.e., use one processor)

$$
T(n, 1)=O\left(n^{x} \log n\right)=O(n \log n)
$$

Remember
$p=O\left(n^{1-x}\right)$

Our asymptotic analysis is valid for $\boldsymbol{x}>\mathbf{0}$ but not for $\boldsymbol{x}=\mathbf{0}$;
i.e., PRAMselectionsort cannot sort $p$ keys in optimal $O(\log p)$ time.

## Example of Parallel Sorting

## 

Threshold values for $k=4$ (i.e., $x=1 / 2$ and $p=n^{1 / 2}$ processors):

$$
\begin{aligned}
& m_{0}=-\infty \\
& m_{1}=P R \\
& m_{2}=P R \\
& m_{3}=P R \\
& m_{4}=+\infty
\end{aligned}
$$

$$
n / k=25 / 4 \cong 6 \quad m_{1}=\operatorname{PRAMselect}(S, 6,5)=2
$$

$$
2 n / k=50 / 4 \cong 13 \quad m_{2}=\operatorname{PRAMselect}(S, 13,5)=4
$$

$$
3 n / k=75 / 4 \cong 19 \quad m_{3}=\operatorname{PRAMselect}(S, 19,5)=6
$$

$$
\begin{gathered}
m_{0} \\
T: \ldots
\end{gathered} m_{1} m_{2} \quad m_{3} \ldots m_{4}
$$

$T:$| 0 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 5 | 5 | 6 | 6 | 6 | 7 | 7 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

## 6A. 5 Alternative Sorting Algorithms

Sorting via random sampling (assume $p \ll \sqrt{ } n$ )
Given a large list $S$ of inputs, a random sample of the elements can be used to find $k$ comparison thresholds
It is easier if we pick $k=p$, so that each of the resulting
subproblems is handled by a single processor
Parallel randomized sort $P R A M$ randomsort( $S, p$ )

1. Processor $j, 0 \leq j<p$, pick $|S| / p^{2}$ random samples of
its $\mid S / / p$ elements and store them in its corresponding
section of a list $T$ of length $|S| / p$
2. Processor 0 sort the list $T$
\{comparison threshold $m_{i}$ is the $\left(i|S| / p^{2}\right)$ th element of $T$ \}
3. Processor $j, 0 \leq j<p$, store its elements falling in $\left(m_{i}, m_{i+1}\right)$ into $T(i)$
4. Processor $j, 0 \leq j<p$, sort the sublist $T(j)$

## Parallel Binsort or Bucketsort

Suppose input values are in $[0, m$ )
Divide the range into $p$ subranges
$[0, m / p),[m / p, 2 m / p), \ldots$
Processor $j$ sorts the elements in the jth subrange
If values are uniformly distributed, each processor gets $\sim n / p$ values

Avg. time $=\frac{O(n / p)}{\begin{array}{c}\text { Fill } \\ \text { buckets }\end{array}}+\frac{O((n / p) \log (n / p)}{\begin{array}{c}\text { Sort } \\ \text { buckets }\end{array}}$


If the list consists of small range of values (say, numbers 0-9), bucketsort can be used to count the number of occurrences of each value in $O(n)$ time serially and in $O(n / p)$ time in parallel

## Parallel Radixsort

In binary version of radixsort, we examine every bit of the $k$-bit keys in turn, starting from the LSB
In Step $i$, bit $i$ is examined, $0 \leq i<k$
Records are stably sorted by the value of the ith key bit

| Binary forms | Input list | Sort by LSB | Sort by middle bit | Sort by MSB |
| :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |
|  | 5 (101) | $4(100)$ | $\rightarrow 4(100)$ | $\begin{array}{r} 1(001) \\ 2(010) \end{array}$ |
|  | 7 (111) | $2(010)$ | ${ }^{7} 5$ (101) | $2(010)$ |
| Question: How are the data movements performed? | 3 (011) | $\underline{2}$ (010) | 1 1(001) | $\checkmark 2$ (010) |
|  | 1 (001) | 5 (101) | 2 (010) | -3 (011) |
|  | 4 (100) | 7 (111) | 2 (010) | 4 (100) |
|  | 2 (010) | 3 (011) | 7 (111) | 5 (101) |
|  | 7 (111) | 1 (001) | 3 (011) | 7 (111) |
|  | 2 (010) | 7 (111) | 7 (111) | 7 (111) |

## Data Movements in Parallel Radixsort

| Input list | Compl of bit 0 | Diminished prefix sums | Bit 0 | Prefix sums plus 2 | Shifted list |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 5 (101) | 0 | - | 1 | $1+2=3$ | 4 (100) |
| 7 (111) | 0 | - | 1 | $2+2=4$ | 2 (010) |
| 3 (011) | 0 | - | 1 | $3+2=5$ | $\underline{2(010)}$ |
| 1 (001) | 0 | - | 1 | $4+2=6$ | 5 (101) |
| 4 (100) | 1 | 0 | 0 | - | 7 (111) |
| 2 (010) | 1 | 1 | 0 | - | 3 (011) |
| 7 (111) | 0 | - | 1 | $5+2=7$ | 1 (001) |
| 2 (010) | 1 | 2 | 0 | - | 7 (111) |

Running time consists mainly of the time to perform $2 k$ parallel prefix computations: $\mathrm{O}(\log p)$ for $k$ constant

## 6A. 6 Convex Hull of a 2D Point Set




## PRAM Convex Hull Algorithm

Parallel convex hull algorithm PRAMconvexhull( $S, p$ )

1. Sort point set by $x$ coordinates
2. Divide sorted list into $\sqrt{ } p$ subsets $Q^{(i)}$ of size $\sqrt{ } p, 0 \leq i<\sqrt{ } p$
3. Find convex hull of each subset $Q^{(i)}$ using $\sqrt{ } p$ processors
4. Merge $\sqrt{ } p$ convex hulls $\mathrm{CH}\left(Q^{(i)}\right)$ into overall hull $\mathrm{CH}(Q)$


Fig. 6.4 Multiway divide and conquer for the convex hull problem

## Merging of Partial Convex Hulls


(a) No point of $\mathrm{CH}(\mathrm{Q}(i))$ is on $\mathrm{CH}(\mathrm{Q})$

(b) Points of $\mathrm{CH}(\mathrm{Q}(i))$ from $A$ to $B$ are on $\mathrm{CH}(\mathrm{Q})$

Tangent lines are found through binary search in log time

Analysis:
$T(p, p)$
$=T\left(p^{1 / 2}, p^{1 / 2}\right)+c \log p$ $\cong 2 c \log p$

The initial sorting also takes $\mathrm{O}(\log p)$ time

Fig. 6.5 Finding points in a partial hull that belong to the combined hull.
Parallel Processing, Shared-Memory Parallelism

## 6B Implementation of Shared Memory

Main challenge: Easing the memory access bottleneck

- Providing low-latency, high-bandwidth paths to memory
- Reducing the need for access to nonlocal memory
- Reducing conflicts and sensitivity to memory latency

| Topics in This Chapter |  |
| :--- | :--- |
| 6B. 1 | Processor-Memory Interconnection |
| 6B. 2 | Multistage Interconnection Networks |
| 6B. 3 | Cache Coherence Protocols |
| 6B. 4 | Data Allocation for Conflict-Free Access |
| 6B. 5 | Distributed Shared Memory |
| 6B. 6 | Methods for Memory Latency Hiding |

## About the New Chapter 6B

This new chapter incorporates material from the following existing sections of the book:
6.6 Some Implementation Aspects
14.4 Dimension-Order Routing
15.3 Plus-or-Minus-2i Network
16.6 Multistage Interconnection Networks
17.2 Distributed Shared Memory
18.1 Data Access Problems and Caching
18.2 Cache Coherence Protocols
18.3 Multithreading and Latency Hiding
20.1 Coordination and Synchronization

Parallel Processing, Shared-Memory Parallelism

## Making PRAM Practical

PRAM needs a read and a write access to memory in every cycle
Even for a sequential computer, memory is tens to hundreds of times slower than arithmetic/logic operations; multiple processors accessing a shared memory only makes the situation worse

Shared access to a single large physical memory isn't scalable
Strategies and focal points for making PRAM practical

1. Make memory accesses faster and more efficient (pipelining)
2. Reduce the number of memory accesses (caching, reordering of accesses so as to do more computation per item fetched/stored)
3. Reduce synchronization, so that slow memory accesses for one computation do not slow down others (synch, memory consistency)
4. Distribute the memory and data to make most accesses local
5. Store data structures to reduce access conflicts (skewed storage)

Parallel Processing, Shared-Memory Parallelism


## 6B. 1 Processor-Memory Interconnection



Parallel I/O
Fig. 4.3 A parallel processor with global (shared) memory.

Parallel Processing, Shared-Memory Parallelism


## Processor-to-Memory Network



An $8 \times 8$ crossbar switch

Crossbar switches offer full permutation capability (they are nonblocking), but are complex and expensive: $\mathrm{O}\left(p^{2}\right)$

Even with a permutation network, full PRAM functionality is not realized: two processors cannot access different addresses in the same memory module

Practical processor-tomemory networks cannot realize all permutations (they are blocking)

## Bus-Based Interconnections

## Single-bus system:

Bandwidth bottleneck Bus loading limit Scalability: very poor Single failure point Conceptually simple Forced serialization


## Multiple-bus system:

Bandwidth improved Bus loading limit Scalability: poor More robust
More complex scheduling Simple serialization


## Back-of-the-Envelope Bus Bandwidth Calculation

Single-bus system:
Bus frequency: 0.5 GHz Data width: 256 b (32 B) Mem. Access: 2 bus cycles $(0.5 \mathrm{G}) / 2 \times 32=8 \mathrm{~GB} / \mathrm{s}$
Bus cycle $=2$ ns
 Memory cycle $=100 \mathrm{~ns}$
1 mem. cycle $=50$ bus cycles

Multiple-bus system:
Peak bandwidth multiplied by the number of buses (actual bandwidth is likely to be much less)


## Hierarchical Bus Interconnection



Fig. 4.9 Example of a hierarchical interconnection architecture.

Parallel Processing, Shared-Memory Parallelism
(1) iccernclll

## Removing the Processor-to-Memory Bottleneck



Fig. 4.4 A parallel processor with global memory and processor caches.
Parallel Processing, Shared-Memory Parallelism

## Why Data Caching Works

Hit rate $r$ (fraction of memory accesses satisfied by cache)
$C_{\text {eff }}=C_{\text {fast }}+(1-r) C_{\text {slow }}$
Cache parameters:
Size
Block length (line width)
Placement policy
Replacement policy
Write policy
Fig. 18.1 Data storage and access in a two-way set-associative cache.


## Benefits of Caching Formulated as Amdahl's Law

Hit rate $r$ (fraction of memory accesses satisfied by cache)

$$
\begin{aligned}
C_{\text {eff }} & =C_{\text {fast }}+(1-r) C_{\text {slow }} \\
S & =C_{\text {slow }} / C_{\text {eff }} \\
& =\frac{1}{(1-r)+C_{\text {fast }} / C_{\text {slow }}}
\end{aligned}
$$



This corresponds to the miss-rate fraction $1-r$ of accesses being unaffected and the hit-rate fraction $r$ (almost 1 ) being speeded up by a factor $C_{\text {slow }} / C_{\text {fast }}$

Generalized form of Amdahl's speedup formula:

$$
S=1 /\left(f_{1} / p_{1}+f_{2} / p_{2}+\ldots+f_{m} / p_{m}\right), \text { with } f_{1}+f_{2}+\ldots+f_{m}=1
$$

In this case, a fraction $1-r$ is slowed down by a factor $\left(C_{\text {slow }}+C_{\text {fast }}\right) / C_{\text {slow }}$, and a fraction $r$ is speeded up by a factor $C_{\text {slow }} / C_{\text {fast }}$

Parallel Processing, Shared-Memory Parallelism

## 6B. 2 Multistage Interconnection Networks



## Cray Y-MP's Interconnection Network



Fig. 21.6 The processor-to-memory interconnection network of Cray Y-MP.

Parallel Processing, Shared-Memory Parallelism
(1) iccillll

## Butterfly Processor-to-Memory Network



Fig. 6.9 Example of a multistage memory access network.

Two ways to use the butterfly:

- Edge switches $1 \times 2$ and $2 \times 1$
- All switches $2 \times 2$

Not a full permutation network (e.g., processor 0 cannot be connected to memory bank 2 alongside the two connections shown)

Is self-routing: i.e., the bank address determines the route

A request going to memory bank 3 (0011) is routed:
lower upper upper

Paralel Processing, Shared Memory Parallelism

## Butterfly as Multistage Interconnection Network



Fig. 6.9 Example of a multistage memory access network


Fig. 15.8 Butterfly network used to connect modules that are on the same side

Generalization of the butterfly network High-radix or $m$-ary butterfly, built of $m \times m$ switches
Has $m^{q}$ rows and $q+1$ columns ( $q$ if wrapped)

## Self-Routing on a Butterfly Network

Fig. 14.6 Example dimension-order routing paths.

Number of cross links taken = length of path in hypercube


From node 3 to 6 : routing tag $=011 \oplus 110=\overleftarrow{101}$ "cross-straight-cross" From node 3 to 5: routing tag $=011 \oplus 101=110$ "straight-cross-cross" From node 6 to 1: routing tag $=110 \oplus 001=111$ "cross-cross-cross"

Parallel Processing, Shared-Memory Parallelism
(1) iccelllll

## Butterfly Is Not a Permutation Network



Fig. 14.7 Packing is a "good" routing problem for dimensionorder routing on the hypercube.


Fig. 14.8 Bit-reversal permutation is a "bad" routing problem for dimensionorder routing on the hypercube.

## Structure of Butterfly Networks

Switching these two row pairs converts this to the original butterfly network. Changing the order of stages in a butterfly is thus equivalent to a relabeling of the rows (in this example, row xyz becomes row xzy)


Fig. 15.5 Butterfly network with permuted dimensions.


The 16 -row butterfly network.

## Beneš Network



Fig. 15.9 Beneš network formed from two back-to-back butterflies.

A $2^{9}$-row Beneš network:
Can route any $2^{q} \times 2^{q}$ permutation
It is "rearrangeable"

## Routing Paths in a Beneš Network

To which memory modules can we connect proc 4 without rearranging the other paths?

What about proc 6 ?


Fig. 15.10 Another example of a Beneš network.

## Augmented Data

 Manipulator NetworkData manipulator network was used in Goodyear MPP, an early SIMD parallel machine.
"Augmented" means that switches in a column are independent, as opposed to all being set to same state (simplified control).

Fig. 15.12 Augmented data manipulator network.


## Fat Trees



Fig. 15.6 Two representations of a fat tree.


Skinny tree?

Front view: Binary tree Binary tree


Fig. 15.7 Butterfly network redrawn as a fat tree.

## The Sea of Indirect Interconnection Networks

Numerous indirect or multistage interconnection networks (MINs) have been proposed for, or used in, parallel computers

They differ in topological, performance, robustness, and realizability attributes

We have already seen the butterfly, hierarchical bus, beneš, and ADM networks

Fig. 4.8 (modified) The sea of indirect interconnection networks.


## Self-Routing Permutation Networks

Do there exist self-routing permutation networks? (The butterfly network is self-routing, but it is not a permutation network)

Permutation routing through a MIN is the same problem as sorting


Fig. 16.14 Example of sorting on a binary radix sort network.

## Partial List of Important MINs

Augmented data manipulator (ADM): aka unfolded PM2I (Fig. 15.12)
Banyan: Any MIN with a unique path between any input and any output (e.g. butterfly)
Baseline: Butterfly network with nodes labeled differently
Beneš: Back-to-back butterfly networks, sharing one column (Figs. 15.9-10)
Bidelta: A MIN that is a delta network in either direction
Butterfly: aka unfolded hypercube (Figs. 6.9, 15.4-5)
Data manipulator: Same as ADM, but with switches in a column restricted to same state
Delta: Any MIN for which the outputs of each switch have distinct labels (say 0 and 1
for $2 \times 2$ switches) and path label, composed of concatenating switch output labels
leading from an input to an output depends only on the output
Flip: Reverse of the omega network (inputs $\times$ outputs)
Indirect cube: Same as butterfly or omega
Omega: Multi-stage shuffle-exchange network; isomorphic to butterfly (Fig. 15.19)
Permutation: Any MIN that can realize all permutations
Rearrangeable: Same as permutation network
Reverse baseline: Baseline network, with the roles of inputs and outputs interchanged

Parallel Processing, Shared-Memory Parallelism


## 6B. 3 Cache Coherence Protocols



Fig. 18.2 Various types of cached data blocks in a parallel processor with global memory and processor caches.

## Example: A Bus-Based Snoopy Protocol

## Each transition is labeled with the event that triggers it,

 followed by the action(s) that must be taken

Fig. 18.3 Finite-state control mechanism for a bus-based snoopy cache coherence protocol.

## Implementing a Snoopy Protocol

A second tags/state storage unit allows snooping to be done concurrently with normal cache operation

Getting all the implementation timing and details right is nontrivial


Fig. 27.7 of Parhami's Computer Architecture text.

Parallel Processing, Shared-Memory Parallelism

## Scalable (Distributed) Shared Memory



## Some Terminology:

NUMA
Nonuniform memory access (distributed shared memory)

UMA
Uniform memory access (global shared memory)

COMA
Cache-only memory arch

Fig. 4.5 A parallel processor with distributed memory.

Parallel Processing, Shared-Memory Parallelism

## Example: A Directory-Based Protocol

Write miss: Fetch data value, request invalidation, return data value, sharing set $=\{c\}$

Read miss: Return data value, sharing set $=$ sharing set $+\{\mathrm{c}\}$


Fig. 18.4 States and transitions for a directory entry in a directory-based coherence protocol ( $c$ denotes the cache sending the message).

## Implementing a Directory-Based Protocol



Sharing set implemented as a bit-vector (simple, but not scalable)

When there are many more nodes (caches) than the typical size of a sharing set, a list of sharing units may be maintained in the directory


The sharing set can be maintained as a distributed doubly linked list (will discuss in Section 18.6 in connection with the SCI standard)

## 6B. 4 Data Allocation for Conflict-Free Access

Try to store the data such that parallel accesses are to different banks
For many data structures, a compiler may perform the memory mapping

Column 2


Fig. 6.6 Matrix storage in column-major order to allow concurrent accesses to rows.

Each matrix column is stored in a different memory module (bank)

Accessing a column leads to conflicts

## Skewed Storage Format



Fig. 6.7 Skewed matrix storage for conflict-free accesses to rows and columns.

## A Unified Theory of Conflict-Free Access

| Vector <br> indices | 0 | 6 | 12 | 18 | 24 | 30 |
| :--- | ---: | ---: | :--- | :--- | :--- | :--- |
|  | 1 | 7 | 13 | 19 | 25 | 31 |
|  | 2 | 8 | 14 | 20 | 26 | 32 |
|  | 9 | 9 | 15 | 21 | 27 | 33 |
| 4 | 10 | 16 | 22 | 28 | 34 |  |
| 5 | 11 | 17 | 23 | 29 | 35 |  |

$A_{i j}$ is viewed as vector element $i+j m$

A qD array can be viewed as a vector, with "row"/"column" accesses associated with constant strides

Fig. 6.8 A $6 \times 6$ matrix viewed, in columnmajor order, as a 36-element vector.

Column: $\quad k, k+1, k+2, k+3, k+4, k+5$
Row:
Diagonal:

$$
k, k+m, k+2 m, k+3 m, k+4 m, k+5 m
$$

$$
k, k+m+1, k+2(m+1), k+3(m+1)
$$

$$
k+4(m+1), k+5(m+1)
$$

Antidiagonal: $\quad k, k+m-1, k+2(m-1), k+3(m-1)$, $k+4(m-1), k+5(m-1)$

Stride $=1$
Stride $=m$

Stride $=m+1$
Stride $=m-1$

## Linear Skewing Schemes

| Vector |  |  |  |  |  |  |
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
| indices | 0 | 6 | 12 | 18 | 24 | 30 |
|  | 1 | 7 | 13 | 19 | 25 | 31 |
|  | 2 | 8 | 14 | 20 | 26 | 32 |
| 3 | 9 | 15 | 21 | 27 | 33 |  |
| 4 | 10 | 16 | 22 | 28 | 34 |  |
| 5 | 11 | 17 | 23 | 29 | 35 |  |

$A_{i j}$ is viewed as vector element $i+j m$

Place vector element $i$ in memory bank $a+b i \bmod B$ (word address within bank is irrelevant to conflict-free access; also, a can be set to 0)

Fig. 6.8 A $6 \times 6$ matrix viewed, in columnmajor order, as a 36-element vector.

With a linear skewing scheme, vector elements $k, k+s, k+2 s, \ldots$, $k+(B-1) s$ will be assigned to different memory banks iff $s b$ is relatively prime with respect to the number $B$ of memory banks.

A prime value for $B$ ensures this condition, but is not very practical.

## 6B. 5 Distributed Shared Memory



## Some Terminology:

NUMA
Nonuniform memory access (distributed shared memory)

UMA
Uniform memory access (global shared memory)

COMA
Cache-only memory arch

Fig. 4.5 A parallel processor with distributed memory.

Parallel Processing, Shared-Memory Parallelism

## Butterfly-Based Distributed Shared Memory

## Randomized emulation of the p-processor PRAM on $p$-node butterfly

Use hash function to map memory locations to modules $p$ locations $\rightarrow p$ modules, not necessarily distinct

With high probability, at most $\mathrm{O}(\log p)$ of the $p$ locations will be in modules located in the same row

Average slowdown $=\mathrm{O}(\log p)$


Fig. 17.2 Butterfly distributed-memory machine emulating the PRAM.

## PRAM Emulation with Butterfly MIN

Emulation of the $p$-processor PRAM on ( $p \log p$ )-node butterfly, with memory modules and processors connected to the two sides; $\mathrm{O}(\log p)$ avg. slowdown

Less efficient than Fig. 17.2, which uses a smaller butterfly

By using $p /(\log p)$ physical processors to emulate the p-processor PRAM, this new emulation scheme becomes quite efficient (pipeline the memory accesses of the $\log p$ virtual processors assigned to each physical processor)


Fig. 17.3 Distributed-memory machine, with a butterfly multistage interconnection network, emulating the PRAM.

Parallel Processing, Shared-Memory Parallelism

## Deterministic Shared-Memory Emulation

Deterministic emulation
of $p$-processor PRAM
on $p$-node butterfly

Store $\log _{2} m$ copies of each of the $m$ memory location contents

Time-stamp each updated value
A "write" is complete once a majority of copies are updated

A "read" is satisfied when a majority of copies are accessed and the one with latest time stamp is used

Why it works: A few congested links won't delay the operation


Parallel Processing, Shared-Memory Parallelism

## PRAM Emulation Using Information Dispersal

Instead of $(\log m)$-fold replication of data, divide each data element into $k$ pieces and encode the pieces using a redundancy factor of 3 , so that any $k / 3$ pieces suffice for reconstructing the original data


Fig. 17.4 Illustrating the information dispersal approach to PRAM emulation with lower data redundancy.

## 6B. 6 Methods for Memory Latency Hiding

By assumption, PRAM accesses memory locations right when they are needed, so processing must stall until data is fetched

## Not a smart strategy:

Memory access time = 100s times that of add time

Method 1: Predict accesses (prefetch) Method 2: Pipeline multiple accesses

Memory modules
m-1
Parallel $1 / O$ UCSB

Proc. 0's access request
Proc. 0's access response

## 6C Shared-Memory Abstractions

A precise memory view is needed for correct algorithm design

- Sequential consistency facilitates programming
- Less strict consistency models offer better performance

| Topics in This Chapter |  |
| :--- | :--- |
| 6C. 1 | Atomicity in Memory Access |
| 6C. 2 | Strict and Sequential Consistency |
| 6C. 3 | Processor Consistency |
| 6C. 4 | Weak or Synchronization Consistency |
| 6C. 5 | Other Memory Consistency Models |
| 6C. 6 | Transactional Memory |

## 6C. 1 Atomicity in Memory Access

Performance optimization and latency hiding often imply that memory accesses are interleaved and perhaps not serviced in the order issued


## Barrier Synchronization Overhead



Fig. 20.3 The performance benefit of less frequent synchronization.

## Synchronization via Message Passing

Task interdependence is often more complicated than the simple prerequisite structure thus far considered


Fig. 20.1 Automatic synchronization in message-passing systems.
Parallel Processing, Shared-Memory Parallelism

## Synchronization with Shared Memory

Accomplished by accessing specially designated shared control variables
The fetch-and-add instruction constitutes a useful atomic operation
If the current value of $x$ is $c$, fetch-and-add $(x, a)$ returns $c$ to the process and overwrites $x=c$ with the value $c+a$

A second process executing fetch-and-add( $x, b$ ) then gets the now current value $c+a$ and modifies it to $c+a+b$

Why atomicity of fetch-and-add is important: With ordinary instructions, the 3 steps of fetch-and-add for $A$ and $B$ may be interleaved as follows:

| Time step 1 | $\frac{\text { Process } A}{\text { read } x}$ |  | Process $B$ |
| :--- | :--- | :--- | :--- |
| Time step 2 |  |  | Comments <br> A's accumulator holds $c$ |
| Time step 3 | add $a$ | read $x$ |  |
| Time step 4 |  |  |  |
| Time step 5 accumulator holds $c$ |  |  |  |

Slide 93

## Barrier Synchronization: Implementations

Make each processor, in a designated set, wait at a barrier until all other processors have arrived at the corresponding points in their computations

Software implementation via fetch-and-add or similar instruction
Hardware implementation via an AND tree (raise flag, check AND result)
A problem with the AND-tree: If a processor can be randomly delayed between raising it flag and checking the tree output, some processors might cross the barrier and lower their flags before others have noticed the change in the AND tree output
Solution: Use two AND trees for alternating barrier points


Fig. 20.4 Example of hardware aid for fast barrier synchronization [Hoar96].

## 6C. 2 Strict and Sequential Consistency

A global notion of time does not exist: The speed of light is finite; therefore, we do not become aware of events instantaneously

Suppose a group of people decide to synchronize their watches
One person shouts: On the count of 3 , set your watches to 1:00 PM
Not everyone will hear the command at the same time

In physics, we learn that two observers do not see the same time

Furthermore, the speed of time passing varies for different observers

## Conclusion: A universal notion of time does not exist!



## Strict Consistency

With strict consistency, a read operation always returns the result of the latest write operation on that data object

Strict consistency is impossible to maintain in a distributed system which does not have a global clock

While clocks can be synchronized, there is always some error that causes trouble in near-simultaneous operations

Example: Three processes sharing variables 1-4 ( $r=$ read, $w=$ write)


Parallel Processing, Shared-Memory Parallelism


## Sequential Consistency

Sequential consistency (original def.): The result of any execution is the same as if processor operations were executed in some sequential order, and the operations of a particular processor appear in the sequence specified by the program it runs


A possible ordering | w 4 | w 1 | r 1 | r 5 | r 2 | r 2 | w 2 | r 3 | r 3 | w 3 | w 4 | w 3 | r 4 | w 2 | r 1 | r 3 | r 1 | r 2 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

A possible ordering | w 4 | r 5 | r 2 | w 1 | r 2 | r 1 | w 2 | r 3 | w 3 | w 4 | w 3 | r 4 | r 1 | r 3 | r 3 | r 1 | r 2 | w 2 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Sequential consistency (new def.): Write operations on the same data object are seen in exactly the same order by all system nodes


## The Performance Penalty of Sequential Consistency

$$
\begin{aligned}
& \text { Initially } \\
& \mathbf{X}=\mathbf{Y}=0
\end{aligned}
$$

$$
\text { Exec } 1
$$

$$
X:=1
$$

R1 :=Y

$$
Y:=1
$$

R2 := X

| Thread 1 |
| :--- |
| $\mathrm{X}:=1$ |
| $\mathrm{R} 1:=\mathrm{Y}$ |

$\frac{\text { Exec } 2}{Y:=1}$
R2:=X
$X:=1$
R1:=Y

Thread 2
Y:=1
R2:=X
Exec 3
$X:=1$
Y:=1
R1:=Y
R2:=X

If a compiler reorders the seemingly independent statements in Thread 1, the desired semantics (R1 and R2 not being both 0 ) is compromised

Relaxed consistency (memory model): Ease the requirements on doing things in program order and/or write atomicity to gain performance

When maintaining order is absolutely necessary, we use synchronization primitives to enforce it

## 6C. 3 Processor Consistency

Processor consistency: Writes by the same processors are seen by all other processors as occurring in the same order; writes by different processors may appear in different order at various nodes
Example: Linear array in which changes in values propagate at the rate of one node per time step


If P 0 and P 4 perform two write operations on consecutive time steps, then this is how the processors will see them

| Step 1 | WA | -- | -- | -- | WX |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Step 2 | WB | WA | -- | WX | WY |
| Step 3 | -- | WB | WA, WX | WY | -- |
| Step 4 | -- | WX | WB, WY | WA |  |
| Step 5 | WX | WY | -- | WB | WA |
| Step 6 | WY | -- | - | -- | WB |
| Winter 2021 | UCSB | Parallel Processing, Shared-Memory Paralelelism | Bitreflill | Side 99 |  |

## 6C. 4 Weak or Synchronization Consistency

Weak consistency: Memory accesses are divided into two categories:
(1) Ordinary data accesses
(2) Synchronizing accesses

Category-1 accesses can be reordered with no limitation
If ordering of two operations is to be maintained, the programmer must specify at least one of them as a synchronizing, or Category-2, access


Slide 100

## 6C.5 Other Memory Consistency Models

Release consistency: Relaxes synchronization consistency somewhat (1) A process can access a shared variable only if all of its previous acquires have completed successfully
(2) A process can perform a release operation only if all of its previous reads and writes have completed
(3) Acquire and release accesses must be sequentially consistent

For more on memory consistency models, see:
Adve, S. V. and K. Gharachorloo, "Shared Memory Consistency Models: A Tutorial," IEEE Computer, Dec. 1996.

Adve, S. V., H.-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware," Communications of the ACM, Aug. 2010.

For general info on memory management, see:
Gaud, F. et al., "Challenges of Memory Management on Modern NUMA Systems," Communications of the ACM, Dec. 2015.

## 6C.6 Transactional Memory

TM systems typically provide atomic statements that allow the execution of a block of code as an all-or-nothing entity (much like a transaction)

Example of transaction: Transfer \$x from account A to Account B
(1) if $a \geq x$
then $\mathrm{a}:=\mathrm{a}-\mathrm{x}$
else return "insufficient funds"
(2) $b:=b+x$
(3) return "transfer successful"

TM allows a group of read \& write operations to be enclosed in a block, so that any changed values become observable to the rest of the system only upon the completion of the entire block


## Examples of Memory Transactions

A group of reads, writes, and intervening operations can be grouped into an atomic transaction

Example: If wa and wb are made part of the same memory transaction, every processor will see both changes or neither of them



## Implementations of Transactional Memory

Software: 2-7 times slower than sequential code [Laru08]
Hardware acceleration: Hardware assists for the most time-consuming parts of TM operations
e.g., maintenance and validation of read sets

Hardware implementation: All required bookkeeping operations are implemented directly in hardware
e.g., by modifying the L1 cache and the coherence protocol

For more information on transactional memory, see:
[Laru08] Larus, J. and C. Kozyrakis, "Transactional Memory," Communications of the ACM, Vol. 51, No. 7, pp. 80-88, July 2008.

Parallel Processing, Shared-Memory Parallelism
Slide 104

